搜索
您的当前位置:首页正文

Informing algorithms for efficient scheduling of synchronizing threads on multiprogrammed s

来源:二三娱乐
InformingAlgorithmsforEfficientSchedulingofSynchronizingThreads

onMultiprogrammedSMPs

ChristosD.AntonopoulosDimitriosS.NikolopoulosTheodoreS.PapatheodorouHighPerformanceInformationSystemsLaboratoryComputersandSystemsResearchLabComputerEngineeringandInformaticsDepartmentCoordinatedScienceLaboratory

UniversityofPatrasUniversityofIllinoisatUrbana-Champaign26500Patras,GREECE1308WestMainStr.,Urbana,IL61801,U.S.A.cda,tsp@hpclab.ceid.upatras.grdsn@hpclab.ceid.upatras.gr

Abstract

Wepresentnovelalgorithmsforefficientschedulingof

synchronizingthreadsonmultiprogrammedSMPs.Theal-gorithmsarebasedonintra-applicationprioritycontrolofsynchronizingthreads.Werefertosuchalgorithmswiththeterm”informingalgorithms”.Prerequisiteforinform-ingalgorithmsistheuseofanefficientcommunicationmediumbetweentheuser-andkernel-levelandtheexis-tenceofin-kernelmechanismsthatallowtheapplicationstocooperatewiththeOSscheduler.Theapplicationsaregiventheopportunitytoinfluence,inanon-intrusiveman-ner,theschedulingdecisionsconcerningtheirthreads.Wecomparetheperformanceofourinformingalgorithmswiththeperformanceofcorrespondingscheduler-obliviousal-gorithmsundermultiprogramming.Weexperimentedonasmall-scale,Intelx86-basedSMP,runningLinux,usingbothmicrobenchmarksandapplicationsfromtheSplash-2benchmarksuite.Theresultssubstantiatethesuperiorityofourapproachandindicatethatthephilosophyofinformingalgorithmsmaybeapplicableinawiderangeofalgorithmsandarchitectures.

mainlyfromthepoorschedulingofsynchronizingthreadsonphysicalprocessors[4],[6].Idlingtimeatsynchroniza-tionpointsmayconstituteasignificantfractionofexecu-tiontimeifoneormoreofthefollowingconditionshold:a)athreadispreemptedwhileholdingalock,b)apre-emptedlock-waiterthreadremainspreemptedafterithasbeengrantedthelockbythepreviousholder,orc)athreadactivelywaitsforotherthreadstoreachaspecificpoint(i.e.abarrier),whilesomeofthesethreadsarepreempted.Theperformancedegradationisworseforsynchroniza-tionalgorithmsdesignedtoprovideperformancescalabilityonalargenumberofprocessorsorfairarbitrationbetweencontendingprocessors[6].Thesealgorithmsintroduce-ei-therimplicitlyorexplicitly-astrictorderingofsynchro-nizationoperationsperformedbydifferentthreads.How-ever,thisorderinggenerallydiffersfromtheorderthreadsarescheduledonphysicalprocessorsbytheOSscheduler.Theproblemofschedulingsynchronizingthreadsinamultiprogrammingenvironmenthasnotbeenadequatelyaddressed,ifatall,incontemporarycommercialSMPschedulersforsmall-andmedium-scalesystems.Inthispaper,weintroduceakernel-levelinfrastructureofgeneralapplicability.Thisinfrastructurearmsmultithreadedappli-cationswithmechanismsthatallowthemtocontrol,inanon-intrusivemanner,theexecutionoftheirownthreads.Inparticular,eachapplicationassignsuser-levelprioritiestoitsthreads.Theseprioritiesreflecttheimportanceofeachthreadfortheprogressoftheapplication.TheOSsched-uler,onitsturn,assignsphysicalprocessorstoapplicationsbasedontheschedulingpolicyimposedbytheoperatingsystemdesigners.Theprioritiesprovidedbytheapplica-tionsaretakenintoaccountduringtheallocationofphysicalprocessorstospecificthreads.TheefficientcommunicationbetweenapplicationsandtheOSschedulerisfacilitatedbytheuseofmemorypages,sharedbetweenuser-andkernel-level[11],[12].Theuser-levelprovidedprioritieshaveap-

1Introduction

Atypicalclassofapplicationssufferingsevereperfor-mancedegradationinthepresenceofmultiprogrammingaremultithreadedapplicationswithfrequentlysynchro-nizingthreads[6].Ithasbeenshownthattheproblemofpoorperformanceundermultiprogrammingoriginates

ThisworkhasbeensupportedbytheHellenicGeneralSecretariatofResearchandTechnology(G.S.R.T.)researchprogram99-E566.

ThisworkhasbeencarriedoutwhilethesecondauthorwaswiththeHighPerformanceInformationSystemsLaboratory,UniversityofPatras,Greece.

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

plicationwidescope,thustheydonotinterferewithOSschedulingdecisionsconcerningotherapplications.ThisguaranteesthatthefairnessimposedbytheOSschedulerisnotaffected.Moreover,thisapproachmakeskernel-levelmechanismsindependentofandoblivioustothealgorithmusingthem.Themechanismscanthusbeusedbydiverseapplications,providedthatthelatterarecapableofquan-tifyingtheimportanceofeachthreadfortheprogressoftheapplication.Takingadvantageofthisproperty,wehaveimplementedsixdifferentinformingsynchronizationalgo-rithmsusingacommonsetofkernelmechanisms.Oural-gorithmscanbeusedforintra-applicationsynchronizationonly.Extendingthemtosupportinter-applicationsynchro-nizationwouldharmthenon-intrusivenessproperty,giventhefactthatapplicationswouldbeallowedtoaltertheuser-levelprioritiesofthreadsbelongingtootherapplications.Thecharacteroftheinformingsynchronizationalgo-rithmspresentedinthispaperisproactive.Theyguaranteethatnolock-waitersorbarrier-waiterswillbegrantedapro-cessorwhilethereexist,inthesameapplication,preemptedlock-holdersorthreadsthathavenotreachedthebarrieryet.ThetargetoperatingsysteminthisstudyisLinux.WehavemodifiedtheLinuxkernel(version2.2.15)inordertoprovidethegeneralmechanismsneededbyinformingal-gorithms.Bothinformingsynchronizationalgorithmsandthecorrespondingscheduler-obliviousversionsareimple-mentedandprovidedasarun-timesynchronizationlibrary.Therestofthispaperisorganizedasfollows.Section2discussesrelatedwork.Insection3wepresentdetailsonthekernel-leveldesignandimplementation.Insection4wedescribethenovelinformingalgorithmsandtheircharacter-istics.Insection5wepresentanexperimentalevaluationofboththenovelalgorithmsandthecorrespondingscheduler-obliviouscounterpartsonanIntelx86-basedSMPandsec-tion6concludesthepaper.

2RelatedWork

Theperformancedegradationsufferedbyparallelpro-gramswithfrequentsynchronizationamongtheirthreads,especiallyinthepresenceofmultiprogramming,hasbeenidentifiedbyseveralresearchersinthepast.Inthissection,weoutlinesomeoftheproposedsolutions.

Thedecisiononwhetherathreadthatwaitsatasynchro-nizationpointshouldactivelyspin,competitivelyspin(spinforacertaintime-windowandtheblock)orimmediatelyblockhasbeenahot-spotinresearch.In[5],A.Karlinetal.reachtheconclusionthatadaptivecompetitivespinningal-gorithmsgenerallyoutperformstaticcompetitivespinningones.Thelatterareconsidered,intheirturn,betterthanactivelyspinningandimmediateblockingalgorithms.TheevaluationhasbeencarriedoutontheFireflyMultiproces-sor.However,in[7]B.LimandA.Agarwalreachadiffer-

entconclusion.TheirexperimentsontheAlewifemultipro-cessorshowthatimmediatelyblockingalwayssustainsper-formanceclosetothebestamongallstrategiescompared.ThecontradictionofresultsisattributedtothefactthatthecontextswitchonAlewifeissignificantlyfasterthanonFirefly,whichbenefitsimmediateblockingalgorithms.Anotherapproachistheuseofnon-blockingsynchro-nizationalgorithms.Suchalgorithmshavebydefinitionnocriticalsections,duringwhichathreadpreemptionmightcauseperformanceproblems.Theyareeasilyapplicabletosimpledatastructuresandhaveproventobequiteef-ficientundermultiprogramming[10].Unfortunately,theyarebasedonuniversalatomicprimitiveswhichcanatomi-callychangethecontentsofshort(typicallyupto64bits),continuousmemoryareas.Ifthedatastructureisnotthatshort,morecomplexalgorithmsmustbeused.Thesealgo-rithmsinvolvecopyingofthecontentsofthedatastructureandaregenerallyinefficient.

Therehasbeenworkinthepasttoprovidesynchroniza-tionalgorithmsthatcooperatewiththeOSscheduler.In[6]L.Kontothanassisetal.presentscheduler-consciousver-sionsforavarietyofalgorithms.Theyassumethatathreadcaninfluencetheschedulerenoughtopreventitselffrombeingpreemptedinashortcriticalsection.Itistheapplica-tion’sresponsibilitytocopewiththedecisionsofthesched-uleroverlongerperiodsoftime.However,theapplicationisgiven-evenforshorttimeperiods-theopportunitytoin-fluenceschedulingdecisions,possiblyattheexpenseoftheotherapplications.Moreover,thescheduler-consciousal-gorithmsalterbasicpropertiesoftheirscheduler-obliviouscounterparts.Forexample,theydonotpreservetheFCFSpropertyofseveralmutualexclusionalgorithms.

In[3]D.Blackproposesapriorityschemewithuser-levelprovidedprioritiesforthreadsthatcontendformutualexclusiveaccesstoatomicregions.Theseprioritieshaveglobalscope,amongallthreadsinthesystemandtheirroleistoaposterioriconfrontwithinopportunepreemptionsoflock-holderprocesses.Theglobalscopeofuser-levelpro-videdprioritiesisunacceptableinareal,multiuseroperat-ingsystem,becauseitprovidesameansofoverridingthestandardschedulingpolicythusharmingfairness.AnothersimilaraposterioriapproachispresentedbyT.Andersonetal.in[1].Ifathreadexecutinginacriticalsectionisfoundpreempted,itisresumed-viaauser-levelcontextswitch-andexecuteduntilexitingthecriticalsection.Fi-nally,in[8]B.Marshetal.introducea”twominute”warn-ingmechanism,usedtoavoidinopportunepreemptionsofthreadsexecutingcodeinacriticalsection.Thethreadsareinformedthattheyaregoingtobepreemptedandaregranteda”grace”timeperiodtoexitthecriticalsection.Barriersynchronizationisnotdealtwithinanyofthethreepreviousapproaches.

TheSunSolarisoperatingsystemsupportsamechanism

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

whichallowsathreadtoprovidehintstotheOSsched-ulerthatitshould,ifpossible,notbepreempted[14].Thismechanismcanbeusedfortheimplementationofweakpreemption-safelocks.TheCellularIRIX[13]andLinuxoperatingsystems,ontheotherhand,providecompetitivespinningsynchronizationprimitives.

nismsareabsolutelyindependentofthealgorithmsusedatuser-level.TheOSkernelisnotprovidedwithanyinfor-mationanddoesnotmakeanyassumptionsonthenatureofthealgorithm.This,makesourkernelmechanismsusablebyamultitudeofinformingalgorithms.

3Kernel-LevelDesignandImplementation

4

TheImplementationofInformingSyn-chronizationAlgorithms

Thekernel-levelimplementationhasbeencarriedoutinthe2.2.15Linuxkernelandoriginatesfromourpreviousworkonthesameoperatingsystem[11].Theapplicationsdeclarethattheyaregoingtousethemechanismsweintro-duceviaaspecificsystemcallatthebeginningoftheirex-ecutionlife.Kernelmechanismsarebasedontheexistenceanduseofanefficientuser-/kernel-levelbidirectionalcom-municationinterface(SharedArena).TheSharedArenaisimplementedasasharedmemorypagebetweeneachappli-cationinthesystemthatusesourmechanismsandtheOSkernel.Thisimplementationhastheadvantageofminimalcommunicationoverhead,equaltothatofsimplememoryreadsandwrites.TheapplicationsusetheSharedArenainordertoassignprioritiestotheirthreads.Thekernel,onitsturn,usesittoinformtheapplicationsonthenumberofprocessorsallocatedtothematanygiventimeandontheexactstateoftheirthreads(running,preempted).

Attheendofeachtimequantumorwhenaprocessorisidling,theoperatingsystemschedulerselectsthemostappropriatethreadtobeexecutedonthatprocessor.ThisstepoftheselectionprocessenforcestheOSschedulingrules.Iftheselectedthreadbelongstoanapplicationwhichusesourmechanisms,asecondlevelschedulerisinvoked.Thatschedulerselectsthethreadwiththehighestpriority,asspecifiedfromuser-level,amongthethreadsofthesameapplicationandallocatestheprocessortoit.Thispolicyguaranteesthateachapplicationwillbegrantedexactlytheexecutiontimeitwouldbegrantedwithoutourkernelex-tensionsthuspreservingfairnessamongapplications.Atthesametime,itallowsapplicationstousetheircpu-timeinamoreeffectiveway.

Everydecreaseoftheuser-levelpriorityofathreadmayresultinasituationwherehigherprioritythreadsarepre-empted.Ifthisisthecase,thethreadhandsoffitsproces-sorinfavorofthehigherprioritythread.Thisisachievedbyexecutingasystemcallwhichpracticallyinitiatesanewsecondlevelschedulingforthespecificapplication.

Theresultsofallschedulingdecisionsthataffectappli-cationsusingourmechanismsarecommunicatedimmedi-atelytotheuser-levelviaupdatesofthevaluesoftheap-propriatefieldsintheSharedArena.

Theuser-levelprioritycontrolisclearlynon-intrusive,inthesensethatanapplicationcannotaffecttheschedulingofotherapplicationsinthesystem.Furthermore,themecha-Thenovelsynchronizationalgorithmswepresenttakeadvantageofthekernelmechanismsdescribedinsection3.TheyprovidehintstotheOSscheduler,viaprioritiesas-signedtothreadsfromuser-level,inordertoimprovetheschedulingofsynchronizingthreads.Thealgorithmsarerepresentativeoftwoofthemostfrequentlyusedsynchro-nizationschemesinsharedmemorysystems:mutualexclu-sionandbarriersynchronization.Wehaveimplementedasimpletestandsetlockwithlocalspinning(TTASLock),anarraybasedqueueinglock(QueueLock),avariantofthelistbasedqueueinglockproposedbyMellor-CrummeyandScott(MCSLock)[9],aticketlock(TicketLock),acentralizedincrementalbarrier(IncrementalBarrier)andatreebarrierbasedonthealgorithmproposedbyMellor-CrummeyandScott(MCSBarrier)[9].

Theorganizationofthepriorityclassesisbasedonthefollowingrules:a)Athreadthatholdsalockcannotbepreemptedinfavorofanyotherthread.b)Athreadthatwaitsforalockwhichiscurrentlyownedbyanotherthreadisnotexecutedattheexpenseofathreadnotparticipatinginsynchronizationoperations.Thethreadthatwaitsforalockwillnotexecuteanyusefulcomputation,whileathreadthatisnotsynchronizingwithothersprobablywill.Theonlyexceptionarelock-waiterswhichwillbegrantedthelocknext,inalgorithmsthatposeanorderingoflockacquires.Insuchcases,processorutilizationcanbesacrificedduringshortperiods,inordertoensurethatthethreadwillpro-ceedtotheexecutioninsidethecriticalregionassoonasitishandedthelock.c)Athreadthathasreachedabarrierandwaitsforitspeerthreadstoreachthebarrieraswell,isassignedlowerprioritythanthreadsthatdonotparticipateinsynchronizationoperations.Arun-timesystemcannotassumewhetherthreadsthatcurrentlyexecutecomputationwithoutsynchronizingwill,attheendofthecomputationphase,participateinabarrierornot.Thismeansthatthepriorityofthelattercannotbeincreased.Theonlypossi-blealternativeistodecreasethepriorityofthreadsthathavealreadyreachedthebarrier.

Inourimplementationwedefineandusefivepriorityclasses.Thebasicpriority(RUNNINGPRIORITY)isthepriorityofthreadsthatdonotcurrentlyparticipateinanysynchronizationoperation.Thehighestpriorityisassignedtothreadsthatholdalock(LOCKHOLDERPRIORITY).Threadsthatwaitforalockconstitutethepriorityclass

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

belowbasicpriority(LOCKWAITERPRIORITY).Ifamutualexclusionsynchronizationalgorithmhandsthelocktowaiterthreadsindeterministicorder,thepri-orityofthethreadtobegrantedthelocknextissettoLOCKIMMEDIATEWAITERPRIORITY.ThispriorityclassislowerthanLOCKHOLDERPRIORITYandequalto,orhigherthanRUNNINGPRIORITY.Finally,threadsthathavereachedabarrierandwaitfortheirpeerstoreachthebarriertoo,constitutetheweakestpriorityclass(BAR-RIERWAITERPRIORITY).

Rightbeforeparticipatinginasynchronizationoperationeachthreadsavesthevalueofitsuser-levelprovidedprior-ity.Thesavedvalueisusedtorestorethepriorityrightafterthesynchronizationoperationisover.Thistechniqueallowsthreadstoparticipateinnestedlevelsofsynchronization.Theassignmentofuser-levelprovidedprioritiestothreadsisachieved,inmostcases,withsimplewritein-structions.However,agoalofourdesignistoeliminatethepossibilityofallowing,atanytimesnapshot,theprior-ityassignmenttobeinconsistentwiththeactualstatusofthethreads,inthecontextofsynchronizationoperations.Thismighthappenduetoraces,iftwoormorethreadsattempt,atthesametime,tochangeauser-levelpriority.Suchanin-opportunepriorityassignmentcouldresultinperformanceinferiorthantheperformancesustainedbynon-informingsynchronizationalgorithms.Inordertoeliminatethispos-sibility,wehaveidentifiedthecasesinwhicharacemightoccurandusedatomicinstructionsforthepriorityupdate.Inallcases,iftherearecontendingthreadstheyareexactlytwoandtheatomicinstructionisexecutedonlyonce.Oneofthecontenderswillsucceedandtheotherwillfailandaborttheoperation.Thismeansthattheadditionalover-headonthememorysubsystemisnegligible.

Beyondinformingalgorithms,wehaveimplementedtwoversionsofscheduler-obliviousalgorithms:anactivespin-ningandanimmediateblockingone.Asmentionedinsec-tion2,thedecisionwhetherathreadwaitingatasynchro-nizationpointshouldactivelyspin,competitivelyspinorblockiscrucialforperformance.Ouralgorithmsmakeanoptimalspinvs.blockdecision:athreadwillblockonlyiftheapplicationisgrantedlessprocessorsthanitsthreadsandthereisapreemptedthreadofhigheruser-levelpriority.Competitivespinningstrategieshavenotbeenevaluated.Thecontextswitchoverheadforthesystemwehaveexper-imentedonrangesfrom3to22sec,soweexpecttheper-formancedifferentiationamongcompetitivespinningandimmediateblockingalgorithmstobemarginal[7].

Itmustbenotedthatinformingalgorithmspreserve-incontradictionwiththescheduler-consciousalgorithmsofL.Kontothanassisetal.[6]-themaincharacteristicsoftheirsimplecounterparts(timecomplexity,memoryandnetworkoverhead,FCFSserviceofrequestsetc.).Theirmemoryre-quirementsaregenerallyhigherthatthoseofsimplealgo-rithms.However,theyare-withtheexceptionsofTicketLockandIncrementalBarrier-ofthesamecomplexityinrespecttothenumberofsynchronizingthreads.

Inthefollowingsubsectionwedescribethecharacter-isticsandimplementationoftheinformingMCSLockal-gorithm.Thepresentationoftheremainingfiveinformingsynchronizationalgorithms,whichhasbeenomittedduetospacelimitations,canbefoundin[2].

4.1MCSLock

TheMCSlockisalistbasedqueueinglock,proposedbyJ.Mellor-CrummeyandM.Scott[9].ItguaranteesFIFOorderingoflockacquisitions.Allspinsareexecutedonlocally-accessibleflagvariablesonly.Itworksequallywell-intermsofnetworkoverhead-onmachineswithandwith-outcoherentcaches.Eachlockrequiresmemoryspacepro-portionaltothenumberofcontendingthreads.

Eachthreadthatattemptstoacquirethelockinsertsanode,whichrepresentsthethread,tothetailofaqueueofwaitingthreads.Thenextfieldofthenode,whichisapointertothenextnodeinthequeuemustbeinitial-izedtoNULLpriortotheinsertion.Thenodeinsertionisachievedwithanatomicfetchandstoreinstruction.Ifthequeuewaspreviouslyempty,thethreadisthelock-holderandproceedstothecriticalsection.Ifthiswasnotthecase,thethreadsetsthelockedfieldofitsnodetotrue,toindicatethatitisalock-waiterandmakesthenextfieldofitspredecessornodepointtoitsnode.Thepredecessoristhenodereturnedbytheatomicfetchandstorein-struction.Then,thethreadwaits(eitherspinning,orblock-ing)foritsnextfieldtobecomefalse.

Inordertoreleasethelock,thelock-holderinitiallychecksifitsnode’snextfieldpointstoasuccessornode.Ifthereinnosuccessorset,thelock-holdertriestoatom-ically(compareandswap)setthetailofthequeuetoNULL.Asuccesscompletesthelockrelease.Afailureim-pliestheexistenceofathreadwhichhasinserteditsnodeatthetailofthequeue,buthasnotupdatedthenextfieldofitspredecessoryet.Inthiscase,thelock-holderthreadwaitsfortheupdatetocompleteandthensetsthelockedfieldofthesuccessortofalse.Iftheinitialexaminationofthenextfieldindicatestheexistenceofasuccessor,itslockedfieldissimplysettofalse.

ThepseudocodeofaninformingMCSlockisdepictedinfigure1.Thenodeofeachthreadisaugmentedwithtwoadditionalfields:thethreadidentifier(id)andthefieldusedforsavingtheinitialuser-levelpriority(pre-viouspriority).Athreadthatattemptstoacquirethelockinitiallysavesitsidanduser-levelprioritytothecorrespondingfieldsofitsnode(line5).ItthensetsthepointertothenextnodeequaltoNULL(line6)andin-sertsthenodetothetailofthequeueexecutinganatomic

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

1234567

mcs_lock_init(lock_queue){

Initializetheheadoflock_queuetoNULL;}

mcs_lock(lock_queue,my_node,my_id){

Saveboththeprevioususer-levelpriorityofmy_idthreadandmy_idtomy_node.previous_priorityandmy_node.id;Setmy_node.next=NULL

Executeafetch_and_storeatomicinstructiontoinsertmy_nodetothetailofthelock_queueandgetthepreviousvalueofthetail(predecessor);

IfthepredecessorisnotNULL{

Setmy_node.locked=truetoindicatethatthisthreadisnotthelockowner;

Ifthepredecessoristhelockowner{

Settheuser-levelpriorityofthisthreadtoLOCK_IMMEDIATE_WAITER_PRIORITY;

Putmy_nodeinthequeueafterpredecessor(predecessor.next=my_node);}

else{

Putmy_nodeinthequeueafterpredecessor;

Ifthethreaddidnotinthemeantimebecomethelockowner,trytoatomically(compare_and_swap)setitsuser-levelprioritytoLOCK_WAITER_PRIORITY;}

Whilethisthreadisnotthelockowner(my_node.locked=true){

Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisa

preemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;}}

Thisthreadisnowthelockowner:Setitsuser-levelprioritytoLOCK_HOLDER_PRIORITY;

Ifthereisasuccessorofthisthread’snodeinthelock_queue(my_node->next<>NULL){

Setitsuser-levelprioritytoLOCK_IMMEDIATE_WAITER_PRIORITY;}}

mcs_unlock(lock_queue,my_id){

Ifthereisnosuccessornodeinthelock_queue(my_node.next=NULL){

Ifwearesuccessfulatatomically(compare_and_swap)changingthelock_queuetailfrompointingtomy_nodetoNULL{

Restoretheuser-levelpriorityofthisthreadusingthevaluepreviouslysavedinmy_node;

Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisapreemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;Return;}

else{

Somenodeistryingtoenterthequeueasoursuccessor.Waitforit.Inthemeantime,settheuser-levelpriorityofthisthreadtoLOCK_IMMEDIATE_WAITER_PRIORITY(equaltothatofthethreadtobeinserted).Ifthegranted

processorsarelessthanthethreadsoftheapplicationandtherearepreemptedthreadswithhigheruser-levelpriority,handofftheprocessortothem;}}

SetthepriorityofthisthreadtoLOCK_HOLDER_PRIORITYagain;Setthepriorityofthisthread’ssuccessor(succ_node)inthelock_queuetoLOCK_HOLDER_PRIORITY;

Setsucc_node.lockedtofalsetoindicatethatthatthreadisnowthelockowner;

Restoretheuser-levelpriorityofthisthreadusingthepreviouslysavedvalue;

Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisapreemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;}

8910111213141516

171819

202122232425262728293031

32333435

36373839404142

43

Figure1.InformingMCSLockpseudocode

fetchandstoreinstruction.Iftherewasanodeinthewait-queuepriortotheinsertion,itisexaminedwhetherthatnodeisthelock-holderornot.Ifthepredecessoristhelock-holder,theuser-levelpriorityofthecurrentthreadissetequaltoLOCKIMMEDIATEWAITERPRIORITYandthenodeisinsertedinthequeueafterthepredeces-sor(predecessor.next=mynode,lines10-13).Ifthisisnotthecase,thethreadplacesitsnodeinthequeueafterthepredecessorandtriestoatomicallysetitsuser-levelprioritytoLOCKWAITERPRIORITY(lines14-17).Inanycase,afterthepriorityassignment,thethreadpollsthe

valueofthelockedfielduntilthelatterisfoundfalse.Inthemeantime,thethreadmayhandoffitsprocessortootherthreads,shouldthatbenecessary(lines18-20).Whenthethreadiseventuallygrantedthelock,itincreasesitsuser-levelprioritytoLOCKHOLDERPRIORITY.Ifthereisasuccessornodeinthequeue,theuser-levelpriorityofthecorrespondingthreadissetequaltoLOCKIMMEDIATEWAITERPRIORITY.

Duringthelockrelease,thethreadexamineswhetherthereisasuccessornodeinthewait-queue(line28).Ifthereisnot,thethreadtriestoatomically(com-pareandswap)setthetailofthequeue,whichcurrentlypointstoitsnode,toNULL(line29).Iftheatomicin-structionissuccessful,theuser-levelpriorityofthethreadisrestoredusingthevaluesavedinpreviouspriority.Thethreadthencheckswhethertherearepreemptedthreadsofhigheruser-levelpriorityandifthisisthecaseithandsofftheprocessortooneofthem(lines30-33).Afailureoftheatomicinstruction(line29)indicatestheexistenceofathreadwhichhasinserteditsnodeatthetailofthewait-queue,buthasnotyetupdatedthenextfieldofitspredecessor.Thelock-holderdecreasesitsuser-levelprior-itytoLOCKIMMEDIATEWAITERPRIORITYandhandsoffitsprocessor,inordertogivethethreadtryingtoenterthequeuetheopportunitytoexecuteandupdatethenextfieldofthelock-holdernode(lines34-36).If(orwhen)thelock-holderhasasuccessornode,itrestoresitsuser-levelprioritytoLOCKHOLDERPRIORITYandsetsthepriorityofthesuccessortothesamevalue(lines38-39).Consequently,thelockedfieldofthesuccessornodeissettofalse,inordertograntthelocktothecorrespond-ingthread(line40).Finally,thepreviouslock-holderre-storesitsuser-levelpriorityusingthevaluesavedinpre-viouspriorityandhandsoffitsprocessorinfavorofanotherthread,shouldthatbenecessary(lines41-42).

Theuseofanatomicinstructioninline16helpseliminateapotentialracehazardwhichwouldoccurifthepredecessorthreadwasgrantedthelockafterthecheck.Shouldthathappen,thepredecessorwouldattempttosettheuser-levelpriorityofthisthreadtoLOCKIMMEDIATEWAITERPRIORITY.Thenextfieldofthepredecessormustbeupdatedpriortodecreas-ingthepriority(line15).Doingtheoppositecouldre-sulttoimplicationsifthethreadispreemptedbeforeup-datingthenextfieldofthepredecessor.Iftheprede-cessorthreadisthelock-holder,itsnextfieldcanbeupdatedafterthepriorityassignment,asitsnewpriority(LOCKIMMEDIATEWAITERPRIORITY)willbehighenoughtoavoiddeadlock.Thisraisestheneedofanatomicinstructionforthepriorityassignment.

Failingtoincreasethepriorityofthelock-holderinthelock-releasephase(line38)couldresulttoundesirableim-plications(evendeadlock,ifonlyonephysicalprocessor

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

Execution Time (seconds)hasbeengrantedtotheapplication),ifthelock-holderispreemptedaftertheincreaseoftheuser-levelpriorityofitssuccessor.Insuchacase,thecurrentlock-holderwouldnotbegiventheopportunitytosetthelockedfieldofitssuc-cessortofalse,inordertoallowthelattertoenterthecriticalsection.

10090807060504030201001Cholesky5ExperimentalEvaluation

MCS KernelMCS SpinMCS BlockQueue KernelQueue SpinQueue BlockTTAS KernelTTAS SpinTTAS BlockTicket KernelTicket SpinTicket Block824Multiprogramming DegreeInthissection,wepresentanevaluationoftheperfor-manceofinformingsynchronizationalgorithms.Weareparticularlyconcernedwiththeirbehaviourundermultipro-gramming.Thecomparisoninvolvesinformingalgorithmsandtheirsimplecounterparts.

WehavecarriedoutourevaluationonaCompaqProliant5500system.Itisequippedwith4PentiumProproces-sorsclockedat200MHzwith512KBytesL2cacheeach.Themainmemoryis512MBytes.ForthepurposesofourevaluationwehaveusedbothsyntheticmicrobenchmarksandrealcomputationalkernelsandapplicationsfromtheSplash-2benchmarksuite[15].Theworkloadsconsistofoneormoreidenticalinstancesofamicrobenchmarkoranapplication.Eachinstancerequires4processorsandthenumberofconcurrentlyexecutinginstancesisequaltothedesireddegreeofmultiprogramming.Therangeofmul-tiprogrammingdegreeswehaveexperimentedwithspansfrom1to8.

Thelockmicrobenchmarksconsistofamainloopwithalternatingsynchronizingandworkingphases.Onlyonethreadcanbeinthesynchronizingphaseatanytimesnap-shot.Thesynchronizingandworkingphasesconsistof1000and150010%locallycachedmemoryupdatesre-spectively.Thelengthofthemainloopis4096iterations,whichareevenlydistributedamongtheprocessors.Themi-crobenchmarkisexecuted32timesandthereportedtimeisthemeanvalueofallexecutions.Thebarriermicrobench-marksareidenticaltothelockmicrobenchmarks,withthedifferencethatthemainloopconstitutesoftheworkingphase,describedearlier,followedbyabarrier.Thebarrierisimplementedusingthealgorithmunderevaluation.

Inordertodemonstratethattheperformancegainsachievedbyourinformingsynchronizationalgorithmscanbeconsiderableinrealapplicationsaswell,wehaveusedtwocomputationalkernels(LU,Cholesky)andoneappli-cation(Radiosity)fromtheSplash-2benchmarksuite.

TheLUkernelfactorsadensematrixintotheproductofalowerandanuppertriangularmatrix.Blockingtech-niqueshavebeenusedinordertoexploittemporallocalityandreducecommunication.Approximately20to50per-centoftheexecutiontimeisspentonbarriersynchroniza-tionoperations[15].Wehavedecomposeda1024x1024matrix,using16x16blocks.TheCholeskykernelimple-mentsblocked,sparseCholeskyfactorization.Itfactors

1200RadiosityExecution Time (seconds)10008006004002000124Multiprogramming Degree8MCS KernelMCS SpinMCS BlockQueue KernelQueue SpinQueue BlockTTAS KernelTTAS SpinTTAS BlockTicket KernelTicket SpinTicket Block400350LUExecution Time (seconds)300250200150100500124Multiprogramming Degree8Incr KernelIncr SpinIncr BlockMCS KernelMCS SpinMCS BlockFigure2.ExecutiontimesofSplash-2appli-cationsusinglock(Cholesky,Radiosity)andbarrier(LU)algorithms

asparsematrixtotheproductofalowertriangularma-trixanditstranspose.Itisnotgloballysynchronizedbe-tweensteps.Thesynchronizationisachievedusinglocksanditsoverheadrangesbetweenapproximately10and70percentofthetotalexecutiontime[15].Wehaveusedthestandardtk29.Oinputfile,distributedwithCholesky.TheRadiosityapplicationcomputestheequilibriumdistri-butionoflightinascene,usingtheiterativehierarchicaldiffuseradiositymethod.Thestructureofthecomputationishighlyirregular.Thesynchronizationoperations(locks)contributeapproximately15to30percenttotheexecutiontime[15].WehaveexecutedRadiosityinbatchmodeforthestandardroomscene.Theresultsattainedfromthemul-tiprogrammedexecutionofSplash-2computationalkernelsandapplicationsusinginforming(Kernel),activespinning(Spin)andimmediateblocking(Block)synchronizational-gorithmsaredepictedinfigure2.

Themicrobenchmarkresults(notshown)indicatethat,

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

forthetargetarchitecture,simplealgorithms(TTASlockandIncrementalBarrier)significantlyoutperformmorecomplexonesinthepresenceofmultiprogramming.Fornon-informingsynchronizationalgorithmsthiscanbemainlyattributedtothefactthatsimplealgorithmsdonotimposeastrictorderinginthesynchronizationoperationsexecutedbydifferentthreads,thustheyaremoreinsensitivetoinopportuneschedulingdecisions.Inthecaseofinform-ingalgorithmstheexecutiontimeofthelongerinstructionsequencesrequiredbymorecomplexalgorithmscontributesadominantfractionofthetotalexecutiontime,thusresult-ingtoinferiorperformanceincomparisontosimplealgo-rithms.Theperformanceofcomplexandsimplealgorithmsinadedicatedmachineisalmostindistinguishable.How-ever,giventhefactthatcomplexalgorithmsaredesignedforscalabilityonmanyprocessors,weexpectthemtoperformbetterthantheirsimplecounterpartsinalarger,dedicatedmachine.InallcasesbutTTASlockwithmultiprogram-mingdegree2,informingalgorithmsoutperformscheduler-obliviousonesundermultiprogramming.Ourinformingal-gorithmsforlocksexecuteupto348.4timesfaster(average177.6timesfaster)incomparisontospinningalgorithmsandupto8timesfaster(average5.2timesfaster)incom-parisontoimmediateblockingones.Thespeedupattainedforinformingbarrieralgorithmswhencomparedtospin-ningonesisupto1356.4(average512.4).Thecomparisonwithimmediateblockingalgorithmsleadstospeedupsupto10.5(average4.8).Theseperformancegainsareindica-tiveofthecorrectnessofourapproach.Webelievethat,inthepresenceofmultiprogrammingdegree2,thealwaysspinningversionoftheTTASlockperformsbetter(0.35sec/16.7%)thantheinformingTTASlockbecausethecon-tentionisnothighenoughforthebenefitsofourmecha-nismstooutweightheextracostofadditionalinstructions,contextswitchesandcachemissesduetothreadmigrations.MoreovertheinformingTTASlockcannotfullybenefitfromourmechanisms,duetoitsnondeterministicnature.Inadedicatedmachine,themicrobenchmarksthatuseinformingalgorithmsmayneedupto0.07seconds(6.89%)moretoexecutethanthoseimplementedwithsimplealwaysspinningalgorithms.Inanunloadedmachine,informingsynchronizationalgorithmsdegeneratetoalwaysspinningalgorithmswiththecostofsomeadditionalinstructionsinthecriticalpath.Theseadditionalinstructionsmightbecon-sideredresponsibleforthe-evenminimal-slowdown.Itisalsoworthpointingoutthatforlowornomultipro-gramming,theexecutiontimesofmicrobenchmarksusingimmediateblockingalgorithmsareworsethanthoseofmi-crobenchmarksusingalwaysspinningandinformingsyn-chronizationalgorithms.Inthepresenceoflow,ornocon-tentiontheunnecessaryreschedulingsandcontextswitchescausedbyimmediateblockingalgorithmshaveanegativeeffectonperformance.Theproblemispracticallyelim-

inatedwithinformingalgorithmsbecauseoftheoptimalspinvs.blockdecisionthesealgorithmsreach.However,asthemultiprogrammingdegreeraises,thehighestpossi-bleprocessorutilizationisofvitalimportanceandimme-diateblockingalgorithmsoutperformalwaysspinning(butnotinforming)ones.

TheresultsfromtheSplash-2experimentsindicatethattheperformancebenefitsattainedbyinformingsynchro-nizationalgorithmsarequitesoundinrealapplicationsaswell.Intheabsenceofmultiprogramming,informingsyn-chronizationalgorithmshavepracticallyindistinguishableperformancefromscheduler-obliviousones.However,inthepresenceofevenminimalmultiprogramming,inform-ingalgorithmsperformmuchbetter.TheperformanceofCholesky(lock-basedsynchronization)isupto36.8%(av-erage16.7%)higherwheninforminglockalgorithmsareusedinsteadofalwaysspinningones.Thecomparisonwithimmediateblockinglockalgorithmsyieldsmaximumandaverageimprovementof20%and11%respectively.Ra-diosity,whichalsouseslock-basedsynchronization,exe-cutesupto26.8%faster(average10.7%)ifinforminglockalgorithmsareusedinsteadofalwaysspinningonesandupto4.7timesfaster(average2.5timesfaster)incompari-sonwithimmediateblockingalgorithms.InLU,threadssynchronizeusingbarriers.Ifinformingbarriersareusedinsteadofalwaysspinningones,LUneedsupto6.5timeslessexecutiontime(average4.2timesless).Animmedi-ateblockingimplementationresultstoperformanceupto58.8%(average31.9%)worsethanthatofaninformingim-plementation.

Duetodifferencesinourplatformandimplementationframework,adirectone-to-onecomparisonofinformingsynchronizationalgorithmswiththescheduler-consciousalgorithmspresentedin[6]hasnotbeenpossible.However,anindirectqualitativecomparisonoftheresultsshowsthatinformingsynchronizationalgorithmsprovideimprove-mentsofsimilarorwidermarginthanthoseprovidedbyscheduler-conscioussynchronization.Thiscanbeattributedtothefactthatinformingalgorithmscontrolthreadschedul-ingmoreeffectively.Thescheduler-consciousalgorithmsofKontothanassisetal.trytoavoidpreemptionofthelock-holderattheexpenseofallthreadsinthesystem.Inordertoeliminatethechanceofmaliciousexploitationofthisfea-turethepreemptionisavoidedwithinshorttime-windowsonly.Thismakesscheduler-consciousalgorithmspronetoinefficienciesifthecriticalregionsarenotshort-enough.In-formingalgorithmsfavorthelockholderattheexpenseofotherthreadsofthesameapplicationonly.Thistechniqueavertsthedangerofmaliciousexploitation,sotheneedforatime-windowisraised.Moreover,informingalgorithmspreserve,asopposedtoscheduler-consciousones,thefair-nesscharacteristicsofsynchronizationalgorithms,reduc-ingthusthepossibilityofstarvation.Finally,theoptimality

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

ofthespinvs.blockdecisionprovidedbyinformingal-gorithmscanhavesignificantimpactontheperformance,especiallyforbarrieralgorithms.

[2]

6Conclusions

[3][4]

Inthispaperwehavepresentedagenerallyapplicablekernel-levelinfrastructure,whichcanbeusedtoimprovesignificantlytheperformanceofapplicationsinthepres-enceofmultiprogramming.Ourmechanismsallowappli-cationstousetheprocessortimetheyaregrantedbytheOSschedulermoreeffectively.Thisisachievedbygivingapplicationstheopportunitytoinformtheschedulerontherelativeimportanceoftheirthreads,byassigningthempri-oritiesfromuser-level.Theseprioritiesaretakenintoac-countbytheOSschedulerinordertodecidewhichspecificthreadofanapplicationwillexecuteonaprocessor.Theprioritycontrolmechanismsarenon-intrusive,inthesensethattheydonotallowapplicationstoimprovetheirperfor-manceattheexpenseofotherapplicationsinthesystem.Thekernelinfrastructureisgeneralanddoesnotdependonthenatureoftheapplication,thusitcanbeusedbymanydifferentalgorithms.Asaproofofconcept,wehaveimple-mentedsixdifferentinformingsynchronizationalgorithms,i.e.algorithmsthatusethecommonkernelmechanisms.Wehavepresentedindetailoneofthem.

Inordertoevaluatetheperformanceofinformingsyn-chronizationalgorithmsundermultiprogrammingwehaveexperimentedwithsyntheticmicrobenchmarksandcompu-tationalkernelsandapplicationsfromtheSplash-2suite.Informingsynchronizationalgorithmsperformsignificantlybetterthannon-informingonesinmultiprogrammedenvi-ronmentsandhavepracticallynegligibleoverheadintheabsenceofmultiprogramming.Moreover,thecomparativeevaluationofnon-informingalgorithmshasdrivenustotheconclusionthatsimple,immediateblockingsynchroniza-tionalgorithms,liketheTTASlockortheIncrementalBar-rier,performquitewellonsmall-scalemachines,eveninthepresenceofmultiprogramming.However,eventhesesimplealgorithmscanbenefitfromourkernelmechanisms.Webelievethattheusageofaninfrastructurewhichal-lowsnon-intrusiveuser-levelprioritycontrolisnotneces-sarilylimitedtosmall-scaleSMPs.Inanycase,theapplica-bilityofsuchmechanismsinlarger-scalesystemsrequiresthoroughinvestigation,giventhefactthatsuchsystemsgen-erallyhavedifferentarchitecturalcharacteristicsandmayposesignificantlyhigherthreadmigrationandremotemem-oryaccesscosts.

[5]

[6][7][8][9]

[10]

[11]

[12]

[13][14][15]

References

[1]T.E.Anderson,B.N.Bershad,E.D.Lazowska,andH.M.

Levy.Scheduleractivations:Effectivekernelsupportforthe

user-levelmanagementofparallelism.ACMTransactionsonComputerSystems,10(1):53–79,February1992.

C.D.Antonopoulos,D.S.Nikolopoulos,andT.S.Pap-atheodorou.Informingalgorithmsforefficientschedulingofsynchronizingthreadsonmultiprogrammedsmps.Tech-nicalReportHPCLAB-TR-220101,HighPerformanceIn-formationSystemsLaboratory,UniversityofPatras,Patras,Greece,January2001.

D.Black.Schedulingsupportforconcurrencyandparal-lelisminthemachoperatingsystem.IEEEComputer,23(5),1990.

A.Gupta,A.Tucker,andS.Urushibara.Theimpactofoper-atingsystemschedulingpoliciesandsynchronizationmeth-odsontheperformanceofparallelapplications.InProceed-ingsofthe1991ACMSIGMETRICSconferenceonMea-surementandmodelingofcomputersystems,pages120–132,1991.

A.Karlin,K.Li,M.Manasse,andS.Owicki.Empiricalstudiesofcompetitivespinningforashared-memorymulti-processor.InProceedingsofthe13thACMSymposiumonOperatingSystemPrinciples(SOSP’91),pages41–55,oct1991.

L.I.Kontothanassis,R.W.Wisniewski,andM.L.Scott.Scheduler-conscioussynchronization.ACMTransactionsonComputerSystems,15(1):3–40,February1997.

B.-H.LimandA.Agarwal.Waitingalgorithmsforsynchro-nizationinlarge-scalemultiprocessors.ACMTransactionsonComputerSystems,11(3):253–294,August1993.

B.D.Marsh,M.L.Scott,T.J.LeBlanc,andE.P.Markatos.First-classuser-levelthreads.In13thACMSymposiumonOperatingSystemsPrinciples,October1991.

J.M.Mellor-CrummeyandM.L.Scott.Algorithmsforscalablesynchronizationonshared-memorymultiproces-sors.ACMTransactionsonComputerSystems,9(1):21–65,1991.

M.M.MichaelandM.L.Scott.Relativeperformanceofpreemption-safelockingandnon-blockingsynchronizationonmultiprogrammedsharedmemorymultiprocessors.InProceedingsofthe11thInternationalParallelProcessingSymposium,April1997.

D.S.Nikolopoulos,C.D.Antonopoulos,I.E.Venetis,P.E.Hadjidoukas,E.D.Polychronopoulos,andT.S.Pa-patheodorou.Achievingmultiprogrammingscalabilityonparallelprogramsonintelsmpplatforms:Nanothreadinginthelinuxkernel.InProceedingsofthePARCO99Confer-ence,Delft,Netherlands,June1999.

C.Polychronopoulos,N.Bitar,andS.Kleiman.Nanoth-reads:Auser-levelthreadsarchitecture.TechnicalRe-portCSRD-1297,CSRD,UniversityofIllinoisatUrbana-Champaign,1993.SGI.TopicsInIrixProgramming,Chapter4.http://techpubs.sgi.com.

SunMicrosystems.Solaris8ReferenceManualCollection,Section3:ThreadsandRealtimeLibraryFunctions,March2000.http://docs.sun.com.

S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta.Thesplash-2programs:Characterizationandmethodologi-calconsiderations.InProceedingsofthe22ndAnnualInter-nationalSymposiumonComputerArchitecture(ISCA’95),pages24–36,June1995.

Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE

因篇幅问题不能全部显示,请点此查看更多更全内容

Top