您的当前位置：首页正文

Informing algorithms for efficient scheduling of synchronizing threads on multiprogrammed s

来源：二三娱乐

InformingAlgorithmsforEfﬁcientSchedulingofSynchronizingThreads

onMultiprogrammedSMPs

ChristosD.AntonopoulosDimitriosS.NikolopoulosTheodoreS.PapatheodorouHighPerformanceInformationSystemsLaboratoryComputersandSystemsResearchLabComputerEngineeringandInformaticsDepartmentCoordinatedScienceLaboratory

UniversityofPatrasUniversityofIllinoisatUrbana-Champaign26500Patras,GREECE1308WestMainStr.,Urbana,IL61801,U.S.A.cda,tsp@hpclab.ceid.upatras.grdsn@hpclab.ceid.upatras.gr

Abstract

Wepresentnovelalgorithmsforefﬁcientschedulingof

synchronizingthreadsonmultiprogrammedSMPs.Theal-gorithmsarebasedonintra-applicationprioritycontrolofsynchronizingthreads.Werefertosuchalgorithmswiththeterm”informingalgorithms”.Prerequisiteforinform-ingalgorithmsistheuseofanefﬁcientcommunicationmediumbetweentheuser-andkernel-levelandtheexis-tenceofin-kernelmechanismsthatallowtheapplicationstocooperatewiththeOSscheduler.Theapplicationsaregiventheopportunitytoinﬂuence,inanon-intrusiveman-ner,theschedulingdecisionsconcerningtheirthreads.Wecomparetheperformanceofourinformingalgorithmswiththeperformanceofcorrespondingscheduler-obliviousal-gorithmsundermultiprogramming.Weexperimentedonasmall-scale,Intelx86-basedSMP,runningLinux,usingbothmicrobenchmarksandapplicationsfromtheSplash-2benchmarksuite.Theresultssubstantiatethesuperiorityofourapproachandindicatethatthephilosophyofinformingalgorithmsmaybeapplicableinawiderangeofalgorithmsandarchitectures.

mainlyfromthepoorschedulingofsynchronizingthreadsonphysicalprocessors[4],[6].Idlingtimeatsynchroniza-tionpointsmayconstituteasigniﬁcantfractionofexecu-tiontimeifoneormoreofthefollowingconditionshold:a)athreadispreemptedwhileholdingalock,b)apre-emptedlock-waiterthreadremainspreemptedafterithasbeengrantedthelockbythepreviousholder,orc)athreadactivelywaitsforotherthreadstoreachaspeciﬁcpoint(i.e.abarrier),whilesomeofthesethreadsarepreempted.Theperformancedegradationisworseforsynchroniza-tionalgorithmsdesignedtoprovideperformancescalabilityonalargenumberofprocessorsorfairarbitrationbetweencontendingprocessors[6].Thesealgorithmsintroduce-ei-therimplicitlyorexplicitly-astrictorderingofsynchro-nizationoperationsperformedbydifferentthreads.How-ever,thisorderinggenerallydiffersfromtheorderthreadsarescheduledonphysicalprocessorsbytheOSscheduler.Theproblemofschedulingsynchronizingthreadsinamultiprogrammingenvironmenthasnotbeenadequatelyaddressed,ifatall,incontemporarycommercialSMPschedulersforsmall-andmedium-scalesystems.Inthispaper,weintroduceakernel-levelinfrastructureofgeneralapplicability.Thisinfrastructurearmsmultithreadedappli-cationswithmechanismsthatallowthemtocontrol,inanon-intrusivemanner,theexecutionoftheirownthreads.Inparticular,eachapplicationassignsuser-levelprioritiestoitsthreads.Theseprioritiesreﬂecttheimportanceofeachthreadfortheprogressoftheapplication.TheOSsched-uler,onitsturn,assignsphysicalprocessorstoapplicationsbasedontheschedulingpolicyimposedbytheoperatingsystemdesigners.Theprioritiesprovidedbytheapplica-tionsaretakenintoaccountduringtheallocationofphysicalprocessorstospeciﬁcthreads.TheefﬁcientcommunicationbetweenapplicationsandtheOSschedulerisfacilitatedbytheuseofmemorypages,sharedbetweenuser-andkernel-level[11],[12].Theuser-levelprovidedprioritieshaveap-

1Introduction

Atypicalclassofapplicationssufferingsevereperfor-mancedegradationinthepresenceofmultiprogrammingaremultithreadedapplicationswithfrequentlysynchro-nizingthreads[6].Ithasbeenshownthattheproblemofpoorperformanceundermultiprogrammingoriginates

ThisworkhasbeensupportedbytheHellenicGeneralSecretariatofResearchandTechnology(G.S.R.T.)researchprogram99-E566.

ThisworkhasbeencarriedoutwhilethesecondauthorwaswiththeHighPerformanceInformationSystemsLaboratory,UniversityofPatras,Greece.

plicationwidescope,thustheydonotinterferewithOSschedulingdecisionsconcerningotherapplications.ThisguaranteesthatthefairnessimposedbytheOSschedulerisnotaffected.Moreover,thisapproachmakeskernel-levelmechanismsindependentofandoblivioustothealgorithmusingthem.Themechanismscanthusbeusedbydiverseapplications,providedthatthelatterarecapableofquan-tifyingtheimportanceofeachthreadfortheprogressoftheapplication.Takingadvantageofthisproperty,wehaveimplementedsixdifferentinformingsynchronizationalgo-rithmsusingacommonsetofkernelmechanisms.Oural-gorithmscanbeusedforintra-applicationsynchronizationonly.Extendingthemtosupportinter-applicationsynchro-nizationwouldharmthenon-intrusivenessproperty,giventhefactthatapplicationswouldbeallowedtoaltertheuser-levelprioritiesofthreadsbelongingtootherapplications.Thecharacteroftheinformingsynchronizationalgo-rithmspresentedinthispaperisproactive.Theyguaranteethatnolock-waitersorbarrier-waiterswillbegrantedapro-cessorwhilethereexist,inthesameapplication,preemptedlock-holdersorthreadsthathavenotreachedthebarrieryet.ThetargetoperatingsysteminthisstudyisLinux.WehavemodiﬁedtheLinuxkernel(version2.2.15)inordertoprovidethegeneralmechanismsneededbyinformingal-gorithms.Bothinformingsynchronizationalgorithmsandthecorrespondingscheduler-obliviousversionsareimple-mentedandprovidedasarun-timesynchronizationlibrary.Therestofthispaperisorganizedasfollows.Section2discussesrelatedwork.Insection3wepresentdetailsonthekernel-leveldesignandimplementation.Insection4wedescribethenovelinformingalgorithmsandtheircharacter-istics.Insection5wepresentanexperimentalevaluationofboththenovelalgorithmsandthecorrespondingscheduler-obliviouscounterpartsonanIntelx86-basedSMPandsec-tion6concludesthepaper.

2RelatedWork

Theperformancedegradationsufferedbyparallelpro-gramswithfrequentsynchronizationamongtheirthreads,especiallyinthepresenceofmultiprogramming,hasbeenidentiﬁedbyseveralresearchersinthepast.Inthissection,weoutlinesomeoftheproposedsolutions.

Thedecisiononwhetherathreadthatwaitsatasynchro-nizationpointshouldactivelyspin,competitivelyspin(spinforacertaintime-windowandtheblock)orimmediatelyblockhasbeenahot-spotinresearch.In[5],A.Karlinetal.reachtheconclusionthatadaptivecompetitivespinningal-gorithmsgenerallyoutperformstaticcompetitivespinningones.Thelatterareconsidered,intheirturn,betterthanactivelyspinningandimmediateblockingalgorithms.TheevaluationhasbeencarriedoutontheFireﬂyMultiproces-sor.However,in[7]B.LimandA.Agarwalreachadiffer-

entconclusion.TheirexperimentsontheAlewifemultipro-cessorshowthatimmediatelyblockingalwayssustainsper-formanceclosetothebestamongallstrategiescompared.ThecontradictionofresultsisattributedtothefactthatthecontextswitchonAlewifeissigniﬁcantlyfasterthanonFireﬂy,whichbeneﬁtsimmediateblockingalgorithms.Anotherapproachistheuseofnon-blockingsynchro-nizationalgorithms.Suchalgorithmshavebydeﬁnitionnocriticalsections,duringwhichathreadpreemptionmightcauseperformanceproblems.Theyareeasilyapplicabletosimpledatastructuresandhaveproventobequiteef-ﬁcientundermultiprogramming[10].Unfortunately,theyarebasedonuniversalatomicprimitiveswhichcanatomi-callychangethecontentsofshort(typicallyupto64bits),continuousmemoryareas.Ifthedatastructureisnotthatshort,morecomplexalgorithmsmustbeused.Thesealgo-rithmsinvolvecopyingofthecontentsofthedatastructureandaregenerallyinefﬁcient.

Therehasbeenworkinthepasttoprovidesynchroniza-tionalgorithmsthatcooperatewiththeOSscheduler.In[6]L.Kontothanassisetal.presentscheduler-consciousver-sionsforavarietyofalgorithms.Theyassumethatathreadcaninﬂuencetheschedulerenoughtopreventitselffrombeingpreemptedinashortcriticalsection.Itistheapplica-tion’sresponsibilitytocopewiththedecisionsofthesched-uleroverlongerperiodsoftime.However,theapplicationisgiven-evenforshorttimeperiods-theopportunitytoin-ﬂuenceschedulingdecisions,possiblyattheexpenseoftheotherapplications.Moreover,thescheduler-consciousal-gorithmsalterbasicpropertiesoftheirscheduler-obliviouscounterparts.Forexample,theydonotpreservetheFCFSpropertyofseveralmutualexclusionalgorithms.

In[3]D.Blackproposesapriorityschemewithuser-levelprovidedprioritiesforthreadsthatcontendformutualexclusiveaccesstoatomicregions.Theseprioritieshaveglobalscope,amongallthreadsinthesystemandtheirroleistoaposterioriconfrontwithinopportunepreemptionsoflock-holderprocesses.Theglobalscopeofuser-levelpro-videdprioritiesisunacceptableinareal,multiuseroperat-ingsystem,becauseitprovidesameansofoverridingthestandardschedulingpolicythusharmingfairness.AnothersimilaraposterioriapproachispresentedbyT.Andersonetal.in[1].Ifathreadexecutinginacriticalsectionisfoundpreempted,itisresumed-viaauser-levelcontextswitch-andexecuteduntilexitingthecriticalsection.Fi-nally,in[8]B.Marshetal.introducea”twominute”warn-ingmechanism,usedtoavoidinopportunepreemptionsofthreadsexecutingcodeinacriticalsection.Thethreadsareinformedthattheyaregoingtobepreemptedandaregranteda”grace”timeperiodtoexitthecriticalsection.Barriersynchronizationisnotdealtwithinanyofthethreepreviousapproaches.

TheSunSolarisoperatingsystemsupportsamechanism

whichallowsathreadtoprovidehintstotheOSsched-ulerthatitshould,ifpossible,notbepreempted[14].Thismechanismcanbeusedfortheimplementationofweakpreemption-safelocks.TheCellularIRIX[13]andLinuxoperatingsystems,ontheotherhand,providecompetitivespinningsynchronizationprimitives.

nismsareabsolutelyindependentofthealgorithmsusedatuser-level.TheOSkernelisnotprovidedwithanyinfor-mationanddoesnotmakeanyassumptionsonthenatureofthealgorithm.This,makesourkernelmechanismsusablebyamultitudeofinformingalgorithms.

3Kernel-LevelDesignandImplementation

TheImplementationofInformingSyn-chronizationAlgorithms

Thekernel-levelimplementationhasbeencarriedoutinthe2.2.15Linuxkernelandoriginatesfromourpreviousworkonthesameoperatingsystem[11].Theapplicationsdeclarethattheyaregoingtousethemechanismsweintro-duceviaaspeciﬁcsystemcallatthebeginningoftheirex-ecutionlife.Kernelmechanismsarebasedontheexistenceanduseofanefﬁcientuser-/kernel-levelbidirectionalcom-municationinterface(SharedArena).TheSharedArenaisimplementedasasharedmemorypagebetweeneachappli-cationinthesystemthatusesourmechanismsandtheOSkernel.Thisimplementationhastheadvantageofminimalcommunicationoverhead,equaltothatofsimplememoryreadsandwrites.TheapplicationsusetheSharedArenainordertoassignprioritiestotheirthreads.Thekernel,onitsturn,usesittoinformtheapplicationsonthenumberofprocessorsallocatedtothematanygiventimeandontheexactstateoftheirthreads(running,preempted).

Attheendofeachtimequantumorwhenaprocessorisidling,theoperatingsystemschedulerselectsthemostappropriatethreadtobeexecutedonthatprocessor.ThisstepoftheselectionprocessenforcestheOSschedulingrules.Iftheselectedthreadbelongstoanapplicationwhichusesourmechanisms,asecondlevelschedulerisinvoked.Thatschedulerselectsthethreadwiththehighestpriority,asspeciﬁedfromuser-level,amongthethreadsofthesameapplicationandallocatestheprocessortoit.Thispolicyguaranteesthateachapplicationwillbegrantedexactlytheexecutiontimeitwouldbegrantedwithoutourkernelex-tensionsthuspreservingfairnessamongapplications.Atthesametime,itallowsapplicationstousetheircpu-timeinamoreeffectiveway.

Everydecreaseoftheuser-levelpriorityofathreadmayresultinasituationwherehigherprioritythreadsarepre-empted.Ifthisisthecase,thethreadhandsoffitsproces-sorinfavorofthehigherprioritythread.Thisisachievedbyexecutingasystemcallwhichpracticallyinitiatesanewsecondlevelschedulingforthespeciﬁcapplication.

Theresultsofallschedulingdecisionsthataffectappli-cationsusingourmechanismsarecommunicatedimmedi-atelytotheuser-levelviaupdatesofthevaluesoftheap-propriateﬁeldsintheSharedArena.

Theuser-levelprioritycontrolisclearlynon-intrusive,inthesensethatanapplicationcannotaffecttheschedulingofotherapplicationsinthesystem.Furthermore,themecha-Thenovelsynchronizationalgorithmswepresenttakeadvantageofthekernelmechanismsdescribedinsection3.TheyprovidehintstotheOSscheduler,viaprioritiesas-signedtothreadsfromuser-level,inordertoimprovetheschedulingofsynchronizingthreads.Thealgorithmsarerepresentativeoftwoofthemostfrequentlyusedsynchro-nizationschemesinsharedmemorysystems:mutualexclu-sionandbarriersynchronization.Wehaveimplementedasimpletestandsetlockwithlocalspinning(TTASLock),anarraybasedqueueinglock(QueueLock),avariantofthelistbasedqueueinglockproposedbyMellor-CrummeyandScott(MCSLock)[9],aticketlock(TicketLock),acentralizedincrementalbarrier(IncrementalBarrier)andatreebarrierbasedonthealgorithmproposedbyMellor-CrummeyandScott(MCSBarrier)[9].

Theorganizationofthepriorityclassesisbasedonthefollowingrules:a)Athreadthatholdsalockcannotbepreemptedinfavorofanyotherthread.b)Athreadthatwaitsforalockwhichiscurrentlyownedbyanotherthreadisnotexecutedattheexpenseofathreadnotparticipatinginsynchronizationoperations.Thethreadthatwaitsforalockwillnotexecuteanyusefulcomputation,whileathreadthatisnotsynchronizingwithothersprobablywill.Theonlyexceptionarelock-waiterswhichwillbegrantedthelocknext,inalgorithmsthatposeanorderingoflockacquires.Insuchcases,processorutilizationcanbesacriﬁcedduringshortperiods,inordertoensurethatthethreadwillpro-ceedtotheexecutioninsidethecriticalregionassoonasitishandedthelock.c)Athreadthathasreachedabarrierandwaitsforitspeerthreadstoreachthebarrieraswell,isassignedlowerprioritythanthreadsthatdonotparticipateinsynchronizationoperations.Arun-timesystemcannotassumewhetherthreadsthatcurrentlyexecutecomputationwithoutsynchronizingwill,attheendofthecomputationphase,participateinabarrierornot.Thismeansthatthepriorityofthelattercannotbeincreased.Theonlypossi-blealternativeistodecreasethepriorityofthreadsthathavealreadyreachedthebarrier.

Inourimplementationwedeﬁneanduseﬁvepriorityclasses.Thebasicpriority(RUNNINGPRIORITY)isthepriorityofthreadsthatdonotcurrentlyparticipateinanysynchronizationoperation.Thehighestpriorityisassignedtothreadsthatholdalock(LOCKHOLDERPRIORITY).Threadsthatwaitforalockconstitutethepriorityclass

belowbasicpriority(LOCKWAITERPRIORITY).Ifamutualexclusionsynchronizationalgorithmhandsthelocktowaiterthreadsindeterministicorder,thepri-orityofthethreadtobegrantedthelocknextissettoLOCKIMMEDIATEWAITERPRIORITY.ThispriorityclassislowerthanLOCKHOLDERPRIORITYandequalto,orhigherthanRUNNINGPRIORITY.Finally,threadsthathavereachedabarrierandwaitfortheirpeerstoreachthebarriertoo,constitutetheweakestpriorityclass(BAR-RIERWAITERPRIORITY).

Rightbeforeparticipatinginasynchronizationoperationeachthreadsavesthevalueofitsuser-levelprovidedprior-ity.Thesavedvalueisusedtorestorethepriorityrightafterthesynchronizationoperationisover.Thistechniqueallowsthreadstoparticipateinnestedlevelsofsynchronization.Theassignmentofuser-levelprovidedprioritiestothreadsisachieved,inmostcases,withsimplewritein-structions.However,agoalofourdesignistoeliminatethepossibilityofallowing,atanytimesnapshot,theprior-ityassignmenttobeinconsistentwiththeactualstatusofthethreads,inthecontextofsynchronizationoperations.Thismighthappenduetoraces,iftwoormorethreadsattempt,atthesametime,tochangeauser-levelpriority.Suchanin-opportunepriorityassignmentcouldresultinperformanceinferiorthantheperformancesustainedbynon-informingsynchronizationalgorithms.Inordertoeliminatethispos-sibility,wehaveidentiﬁedthecasesinwhicharacemightoccurandusedatomicinstructionsforthepriorityupdate.Inallcases,iftherearecontendingthreadstheyareexactlytwoandtheatomicinstructionisexecutedonlyonce.Oneofthecontenderswillsucceedandtheotherwillfailandaborttheoperation.Thismeansthattheadditionalover-headonthememorysubsystemisnegligible.

Beyondinformingalgorithms,wehaveimplementedtwoversionsofscheduler-obliviousalgorithms:anactivespin-ningandanimmediateblockingone.Asmentionedinsec-tion2,thedecisionwhetherathreadwaitingatasynchro-nizationpointshouldactivelyspin,competitivelyspinorblockiscrucialforperformance.Ouralgorithmsmakeanoptimalspinvs.blockdecision:athreadwillblockonlyiftheapplicationisgrantedlessprocessorsthanitsthreadsandthereisapreemptedthreadofhigheruser-levelpriority.Competitivespinningstrategieshavenotbeenevaluated.Thecontextswitchoverheadforthesystemwehaveexper-imentedonrangesfrom3to22sec,soweexpecttheper-formancedifferentiationamongcompetitivespinningandimmediateblockingalgorithmstobemarginal[7].

Itmustbenotedthatinformingalgorithmspreserve-incontradictionwiththescheduler-consciousalgorithmsofL.Kontothanassisetal.[6]-themaincharacteristicsoftheirsimplecounterparts(timecomplexity,memoryandnetworkoverhead,FCFSserviceofrequestsetc.).Theirmemoryre-quirementsaregenerallyhigherthatthoseofsimplealgo-rithms.However,theyare-withtheexceptionsofTicketLockandIncrementalBarrier-ofthesamecomplexityinrespecttothenumberofsynchronizingthreads.

Inthefollowingsubsectionwedescribethecharacter-isticsandimplementationoftheinformingMCSLockal-gorithm.Thepresentationoftheremainingﬁveinformingsynchronizationalgorithms,whichhasbeenomittedduetospacelimitations,canbefoundin[2].

4.1MCSLock

TheMCSlockisalistbasedqueueinglock,proposedbyJ.Mellor-CrummeyandM.Scott[9].ItguaranteesFIFOorderingoflockacquisitions.Allspinsareexecutedonlocally-accessibleﬂagvariablesonly.Itworksequallywell-intermsofnetworkoverhead-onmachineswithandwith-outcoherentcaches.Eachlockrequiresmemoryspacepro-portionaltothenumberofcontendingthreads.

Eachthreadthatattemptstoacquirethelockinsertsanode,whichrepresentsthethread,tothetailofaqueueofwaitingthreads.Thenextﬁeldofthenode,whichisapointertothenextnodeinthequeuemustbeinitial-izedtoNULLpriortotheinsertion.Thenodeinsertionisachievedwithanatomicfetchandstoreinstruction.Ifthequeuewaspreviouslyempty,thethreadisthelock-holderandproceedstothecriticalsection.Ifthiswasnotthecase,thethreadsetsthelockedﬁeldofitsnodetotrue,toindicatethatitisalock-waiterandmakesthenextﬁeldofitspredecessornodepointtoitsnode.Thepredecessoristhenodereturnedbytheatomicfetchandstorein-struction.Then,thethreadwaits(eitherspinning,orblock-ing)foritsnextﬁeldtobecomefalse.

Inordertoreleasethelock,thelock-holderinitiallychecksifitsnode’snextﬁeldpointstoasuccessornode.Ifthereinnosuccessorset,thelock-holdertriestoatom-ically(compareandswap)setthetailofthequeuetoNULL.Asuccesscompletesthelockrelease.Afailureim-pliestheexistenceofathreadwhichhasinserteditsnodeatthetailofthequeue,buthasnotupdatedthenextﬁeldofitspredecessoryet.Inthiscase,thelock-holderthreadwaitsfortheupdatetocompleteandthensetsthelockedﬁeldofthesuccessortofalse.Iftheinitialexaminationofthenextﬁeldindicatestheexistenceofasuccessor,itslockedﬁeldissimplysettofalse.

ThepseudocodeofaninformingMCSlockisdepictedinﬁgure1.Thenodeofeachthreadisaugmentedwithtwoadditionalﬁelds:thethreadidentiﬁer(id)andtheﬁeldusedforsavingtheinitialuser-levelpriority(pre-viouspriority).Athreadthatattemptstoacquirethelockinitiallysavesitsidanduser-levelprioritytothecorrespondingﬁeldsofitsnode(line5).ItthensetsthepointertothenextnodeequaltoNULL(line6)andin-sertsthenodetothetailofthequeueexecutinganatomic

1234567

mcs_lock_init(lock_queue){

Initializetheheadoflock_queuetoNULL;}

mcs_lock(lock_queue,my_node,my_id){

Saveboththeprevioususer-levelpriorityofmy_idthreadandmy_idtomy_node.previous_priorityandmy_node.id;Setmy_node.next=NULL

Executeafetch_and_storeatomicinstructiontoinsertmy_nodetothetailofthelock_queueandgetthepreviousvalueofthetail(predecessor);

IfthepredecessorisnotNULL{

Setmy_node.locked=truetoindicatethatthisthreadisnotthelockowner;

Ifthepredecessoristhelockowner{

Settheuser-levelpriorityofthisthreadtoLOCK_IMMEDIATE_WAITER_PRIORITY;

Putmy_nodeinthequeueafterpredecessor(predecessor.next=my_node);}

else{

Putmy_nodeinthequeueafterpredecessor;

Ifthethreaddidnotinthemeantimebecomethelockowner,trytoatomically(compare_and_swap)setitsuser-levelprioritytoLOCK_WAITER_PRIORITY;}

Whilethisthreadisnotthelockowner(my_node.locked=true){

Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisa

preemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;}}

Thisthreadisnowthelockowner:Setitsuser-levelprioritytoLOCK_HOLDER_PRIORITY;

Ifthereisasuccessorofthisthread’snodeinthelock_queue(my_node->next<>NULL){

Setitsuser-levelprioritytoLOCK_IMMEDIATE_WAITER_PRIORITY;}}

mcs_unlock(lock_queue,my_id){

Ifthereisnosuccessornodeinthelock_queue(my_node.next=NULL){

Ifwearesuccessfulatatomically(compare_and_swap)changingthelock_queuetailfrompointingtomy_nodetoNULL{

Restoretheuser-levelpriorityofthisthreadusingthevaluepreviouslysavedinmy_node;

Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisapreemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;Return;}

else{

Somenodeistryingtoenterthequeueasoursuccessor.Waitforit.Inthemeantime,settheuser-levelpriorityofthisthreadtoLOCK_IMMEDIATE_WAITER_PRIORITY(equaltothatofthethreadtobeinserted).Ifthegranted

processorsarelessthanthethreadsoftheapplicationandtherearepreemptedthreadswithhigheruser-levelpriority,handofftheprocessortothem;}}

SetthepriorityofthisthreadtoLOCK_HOLDER_PRIORITYagain;Setthepriorityofthisthread’ssuccessor(succ_node)inthelock_queuetoLOCK_HOLDER_PRIORITY;

Setsucc_node.lockedtofalsetoindicatethatthatthreadisnowthelockowner;

Restoretheuser-levelpriorityofthisthreadusingthepreviouslysavedvalue;

8910111213141516

171819

202122232425262728293031

32333435

36373839404142

Figure1.InformingMCSLockpseudocode

fetchandstoreinstruction.Iftherewasanodeinthewait-queuepriortotheinsertion,itisexaminedwhetherthatnodeisthelock-holderornot.Ifthepredecessoristhelock-holder,theuser-levelpriorityofthecurrentthreadissetequaltoLOCKIMMEDIATEWAITERPRIORITYandthenodeisinsertedinthequeueafterthepredeces-sor(predecessor.next=mynode,lines10-13).Ifthisisnotthecase,thethreadplacesitsnodeinthequeueafterthepredecessorandtriestoatomicallysetitsuser-levelprioritytoLOCKWAITERPRIORITY(lines14-17).Inanycase,afterthepriorityassignment,thethreadpollsthe

valueofthelockedﬁelduntilthelatterisfoundfalse.Inthemeantime,thethreadmayhandoffitsprocessortootherthreads,shouldthatbenecessary(lines18-20).Whenthethreadiseventuallygrantedthelock,itincreasesitsuser-levelprioritytoLOCKHOLDERPRIORITY.Ifthereisasuccessornodeinthequeue,theuser-levelpriorityofthecorrespondingthreadissetequaltoLOCKIMMEDIATEWAITERPRIORITY.

Duringthelockrelease,thethreadexamineswhetherthereisasuccessornodeinthewait-queue(line28).Ifthereisnot,thethreadtriestoatomically(com-pareandswap)setthetailofthequeue,whichcurrentlypointstoitsnode,toNULL(line29).Iftheatomicin-structionissuccessful,theuser-levelpriorityofthethreadisrestoredusingthevaluesavedinpreviouspriority.Thethreadthencheckswhethertherearepreemptedthreadsofhigheruser-levelpriorityandifthisisthecaseithandsofftheprocessortooneofthem(lines30-33).Afailureoftheatomicinstruction(line29)indicatestheexistenceofathreadwhichhasinserteditsnodeatthetailofthewait-queue,buthasnotyetupdatedthenextﬁeldofitspredecessor.Thelock-holderdecreasesitsuser-levelprior-itytoLOCKIMMEDIATEWAITERPRIORITYandhandsoffitsprocessor,inordertogivethethreadtryingtoenterthequeuetheopportunitytoexecuteandupdatethenextﬁeldofthelock-holdernode(lines34-36).If(orwhen)thelock-holderhasasuccessornode,itrestoresitsuser-levelprioritytoLOCKHOLDERPRIORITYandsetsthepriorityofthesuccessortothesamevalue(lines38-39).Consequently,thelockedﬁeldofthesuccessornodeissettofalse,inordertograntthelocktothecorrespond-ingthread(line40).Finally,thepreviouslock-holderre-storesitsuser-levelpriorityusingthevaluesavedinpre-viouspriorityandhandsoffitsprocessorinfavorofanotherthread,shouldthatbenecessary(lines41-42).

Theuseofanatomicinstructioninline16helpseliminateapotentialracehazardwhichwouldoccurifthepredecessorthreadwasgrantedthelockafterthecheck.Shouldthathappen,thepredecessorwouldattempttosettheuser-levelpriorityofthisthreadtoLOCKIMMEDIATEWAITERPRIORITY.Thenextﬁeldofthepredecessormustbeupdatedpriortodecreas-ingthepriority(line15).Doingtheoppositecouldre-sulttoimplicationsifthethreadispreemptedbeforeup-datingthenextﬁeldofthepredecessor.Iftheprede-cessorthreadisthelock-holder,itsnextﬁeldcanbeupdatedafterthepriorityassignment,asitsnewpriority(LOCKIMMEDIATEWAITERPRIORITY)willbehighenoughtoavoiddeadlock.Thisraisestheneedofanatomicinstructionforthepriorityassignment.

Failingtoincreasethepriorityofthelock-holderinthelock-releasephase(line38)couldresulttoundesirableim-plications(evendeadlock,ifonlyonephysicalprocessor

Execution Time (seconds)hasbeengrantedtotheapplication),ifthelock-holderispreemptedaftertheincreaseoftheuser-levelpriorityofitssuccessor.Insuchacase,thecurrentlock-holderwouldnotbegiventheopportunitytosetthelockedﬁeldofitssuc-cessortofalse,inordertoallowthelattertoenterthecriticalsection.

10090807060504030201001Cholesky5ExperimentalEvaluation

MCS KernelMCS SpinMCS BlockQueue KernelQueue SpinQueue BlockTTAS KernelTTAS SpinTTAS BlockTicket KernelTicket SpinTicket Block824Multiprogramming DegreeInthissection,wepresentanevaluationoftheperfor-manceofinformingsynchronizationalgorithms.Weareparticularlyconcernedwiththeirbehaviourundermultipro-gramming.Thecomparisoninvolvesinformingalgorithmsandtheirsimplecounterparts.

WehavecarriedoutourevaluationonaCompaqProliant5500system.Itisequippedwith4PentiumProproces-sorsclockedat200MHzwith512KBytesL2cacheeach.Themainmemoryis512MBytes.ForthepurposesofourevaluationwehaveusedbothsyntheticmicrobenchmarksandrealcomputationalkernelsandapplicationsfromtheSplash-2benchmarksuite[15].Theworkloadsconsistofoneormoreidenticalinstancesofamicrobenchmarkoranapplication.Eachinstancerequires4processorsandthenumberofconcurrentlyexecutinginstancesisequaltothedesireddegreeofmultiprogramming.Therangeofmul-tiprogrammingdegreeswehaveexperimentedwithspansfrom1to8.

Thelockmicrobenchmarksconsistofamainloopwithalternatingsynchronizingandworkingphases.Onlyonethreadcanbeinthesynchronizingphaseatanytimesnap-shot.Thesynchronizingandworkingphasesconsistof1000and150010%locallycachedmemoryupdatesre-spectively.Thelengthofthemainloopis4096iterations,whichareevenlydistributedamongtheprocessors.Themi-crobenchmarkisexecuted32timesandthereportedtimeisthemeanvalueofallexecutions.Thebarriermicrobench-marksareidenticaltothelockmicrobenchmarks,withthedifferencethatthemainloopconstitutesoftheworkingphase,describedearlier,followedbyabarrier.Thebarrierisimplementedusingthealgorithmunderevaluation.

Inordertodemonstratethattheperformancegainsachievedbyourinformingsynchronizationalgorithmscanbeconsiderableinrealapplicationsaswell,wehaveusedtwocomputationalkernels(LU,Cholesky)andoneappli-cation(Radiosity)fromtheSplash-2benchmarksuite.

TheLUkernelfactorsadensematrixintotheproductofalowerandanuppertriangularmatrix.Blockingtech-niqueshavebeenusedinordertoexploittemporallocalityandreducecommunication.Approximately20to50per-centoftheexecutiontimeisspentonbarriersynchroniza-tionoperations[15].Wehavedecomposeda1024x1024matrix,using16x16blocks.TheCholeskykernelimple-mentsblocked,sparseCholeskyfactorization.Itfactors

1200RadiosityExecution Time (seconds)10008006004002000124Multiprogramming Degree8MCS KernelMCS SpinMCS BlockQueue KernelQueue SpinQueue BlockTTAS KernelTTAS SpinTTAS BlockTicket KernelTicket SpinTicket Block400350LUExecution Time (seconds)300250200150100500124Multiprogramming Degree8Incr KernelIncr SpinIncr BlockMCS KernelMCS SpinMCS BlockFigure2.ExecutiontimesofSplash-2appli-cationsusinglock(Cholesky,Radiosity)andbarrier(LU)algorithms

asparsematrixtotheproductofalowertriangularma-trixanditstranspose.Itisnotgloballysynchronizedbe-tweensteps.Thesynchronizationisachievedusinglocksanditsoverheadrangesbetweenapproximately10and70percentofthetotalexecutiontime[15].Wehaveusedthestandardtk29.Oinputﬁle,distributedwithCholesky.TheRadiosityapplicationcomputestheequilibriumdistri-butionoflightinascene,usingtheiterativehierarchicaldiffuseradiositymethod.Thestructureofthecomputationishighlyirregular.Thesynchronizationoperations(locks)contributeapproximately15to30percenttotheexecutiontime[15].WehaveexecutedRadiosityinbatchmodeforthestandardroomscene.Theresultsattainedfromthemul-tiprogrammedexecutionofSplash-2computationalkernelsandapplicationsusinginforming(Kernel),activespinning(Spin)andimmediateblocking(Block)synchronizational-gorithmsaredepictedinﬁgure2.

Themicrobenchmarkresults(notshown)indicatethat,

forthetargetarchitecture,simplealgorithms(TTASlockandIncrementalBarrier)signiﬁcantlyoutperformmorecomplexonesinthepresenceofmultiprogramming.Fornon-informingsynchronizationalgorithmsthiscanbemainlyattributedtothefactthatsimplealgorithmsdonotimposeastrictorderinginthesynchronizationoperationsexecutedbydifferentthreads,thustheyaremoreinsensitivetoinopportuneschedulingdecisions.Inthecaseofinform-ingalgorithmstheexecutiontimeofthelongerinstructionsequencesrequiredbymorecomplexalgorithmscontributesadominantfractionofthetotalexecutiontime,thusresult-ingtoinferiorperformanceincomparisontosimplealgo-rithms.Theperformanceofcomplexandsimplealgorithmsinadedicatedmachineisalmostindistinguishable.How-ever,giventhefactthatcomplexalgorithmsaredesignedforscalabilityonmanyprocessors,weexpectthemtoperformbetterthantheirsimplecounterpartsinalarger,dedicatedmachine.InallcasesbutTTASlockwithmultiprogram-mingdegree2,informingalgorithmsoutperformscheduler-obliviousonesundermultiprogramming.Ourinformingal-gorithmsforlocksexecuteupto348.4timesfaster(average177.6timesfaster)incomparisontospinningalgorithmsandupto8timesfaster(average5.2timesfaster)incom-parisontoimmediateblockingones.Thespeedupattainedforinformingbarrieralgorithmswhencomparedtospin-ningonesisupto1356.4(average512.4).Thecomparisonwithimmediateblockingalgorithmsleadstospeedupsupto10.5(average4.8).Theseperformancegainsareindica-tiveofthecorrectnessofourapproach.Webelievethat,inthepresenceofmultiprogrammingdegree2,thealwaysspinningversionoftheTTASlockperformsbetter(0.35sec/16.7%)thantheinformingTTASlockbecausethecon-tentionisnothighenoughforthebeneﬁtsofourmecha-nismstooutweightheextracostofadditionalinstructions,contextswitchesandcachemissesduetothreadmigrations.MoreovertheinformingTTASlockcannotfullybeneﬁtfromourmechanisms,duetoitsnondeterministicnature.Inadedicatedmachine,themicrobenchmarksthatuseinformingalgorithmsmayneedupto0.07seconds(6.89%)moretoexecutethanthoseimplementedwithsimplealwaysspinningalgorithms.Inanunloadedmachine,informingsynchronizationalgorithmsdegeneratetoalwaysspinningalgorithmswiththecostofsomeadditionalinstructionsinthecriticalpath.Theseadditionalinstructionsmightbecon-sideredresponsibleforthe-evenminimal-slowdown.Itisalsoworthpointingoutthatforlowornomultipro-gramming,theexecutiontimesofmicrobenchmarksusingimmediateblockingalgorithmsareworsethanthoseofmi-crobenchmarksusingalwaysspinningandinformingsyn-chronizationalgorithms.Inthepresenceoflow,ornocon-tentiontheunnecessaryreschedulingsandcontextswitchescausedbyimmediateblockingalgorithmshaveanegativeeffectonperformance.Theproblemispracticallyelim-

inatedwithinformingalgorithmsbecauseoftheoptimalspinvs.blockdecisionthesealgorithmsreach.However,asthemultiprogrammingdegreeraises,thehighestpossi-bleprocessorutilizationisofvitalimportanceandimme-diateblockingalgorithmsoutperformalwaysspinning(butnotinforming)ones.

TheresultsfromtheSplash-2experimentsindicatethattheperformancebeneﬁtsattainedbyinformingsynchro-nizationalgorithmsarequitesoundinrealapplicationsaswell.Intheabsenceofmultiprogramming,informingsyn-chronizationalgorithmshavepracticallyindistinguishableperformancefromscheduler-obliviousones.However,inthepresenceofevenminimalmultiprogramming,inform-ingalgorithmsperformmuchbetter.TheperformanceofCholesky(lock-basedsynchronization)isupto36.8%(av-erage16.7%)higherwheninforminglockalgorithmsareusedinsteadofalwaysspinningones.Thecomparisonwithimmediateblockinglockalgorithmsyieldsmaximumandaverageimprovementof20%and11%respectively.Ra-diosity,whichalsouseslock-basedsynchronization,exe-cutesupto26.8%faster(average10.7%)ifinforminglockalgorithmsareusedinsteadofalwaysspinningonesandupto4.7timesfaster(average2.5timesfaster)incompari-sonwithimmediateblockingalgorithms.InLU,threadssynchronizeusingbarriers.Ifinformingbarriersareusedinsteadofalwaysspinningones,LUneedsupto6.5timeslessexecutiontime(average4.2timesless).Animmedi-ateblockingimplementationresultstoperformanceupto58.8%(average31.9%)worsethanthatofaninformingim-plementation.

Duetodifferencesinourplatformandimplementationframework,adirectone-to-onecomparisonofinformingsynchronizationalgorithmswiththescheduler-consciousalgorithmspresentedin[6]hasnotbeenpossible.However,anindirectqualitativecomparisonoftheresultsshowsthatinformingsynchronizationalgorithmsprovideimprove-mentsofsimilarorwidermarginthanthoseprovidedbyscheduler-conscioussynchronization.Thiscanbeattributedtothefactthatinformingalgorithmscontrolthreadschedul-ingmoreeffectively.Thescheduler-consciousalgorithmsofKontothanassisetal.trytoavoidpreemptionofthelock-holderattheexpenseofallthreadsinthesystem.Inordertoeliminatethechanceofmaliciousexploitationofthisfea-turethepreemptionisavoidedwithinshorttime-windowsonly.Thismakesscheduler-consciousalgorithmspronetoinefﬁcienciesifthecriticalregionsarenotshort-enough.In-formingalgorithmsfavorthelockholderattheexpenseofotherthreadsofthesameapplicationonly.Thistechniqueavertsthedangerofmaliciousexploitation,sotheneedforatime-windowisraised.Moreover,informingalgorithmspreserve,asopposedtoscheduler-consciousones,thefair-nesscharacteristicsofsynchronizationalgorithms,reduc-ingthusthepossibilityofstarvation.Finally,theoptimality

ofthespinvs.blockdecisionprovidedbyinformingal-gorithmscanhavesigniﬁcantimpactontheperformance,especiallyforbarrieralgorithms.

[2]

6Conclusions

[3][4]

Inthispaperwehavepresentedagenerallyapplicablekernel-levelinfrastructure,whichcanbeusedtoimprovesigniﬁcantlytheperformanceofapplicationsinthepres-enceofmultiprogramming.Ourmechanismsallowappli-cationstousetheprocessortimetheyaregrantedbytheOSschedulermoreeffectively.Thisisachievedbygivingapplicationstheopportunitytoinformtheschedulerontherelativeimportanceoftheirthreads,byassigningthempri-oritiesfromuser-level.Theseprioritiesaretakenintoac-countbytheOSschedulerinordertodecidewhichspeciﬁcthreadofanapplicationwillexecuteonaprocessor.Theprioritycontrolmechanismsarenon-intrusive,inthesensethattheydonotallowapplicationstoimprovetheirperfor-manceattheexpenseofotherapplicationsinthesystem.Thekernelinfrastructureisgeneralanddoesnotdependonthenatureoftheapplication,thusitcanbeusedbymanydifferentalgorithms.Asaproofofconcept,wehaveimple-mentedsixdifferentinformingsynchronizationalgorithms,i.e.algorithmsthatusethecommonkernelmechanisms.Wehavepresentedindetailoneofthem.

Inordertoevaluatetheperformanceofinformingsyn-chronizationalgorithmsundermultiprogrammingwehaveexperimentedwithsyntheticmicrobenchmarksandcompu-tationalkernelsandapplicationsfromtheSplash-2suite.Informingsynchronizationalgorithmsperformsigniﬁcantlybetterthannon-informingonesinmultiprogrammedenvi-ronmentsandhavepracticallynegligibleoverheadintheabsenceofmultiprogramming.Moreover,thecomparativeevaluationofnon-informingalgorithmshasdrivenustotheconclusionthatsimple,immediateblockingsynchroniza-tionalgorithms,liketheTTASlockortheIncrementalBar-rier,performquitewellonsmall-scalemachines,eveninthepresenceofmultiprogramming.However,eventhesesimplealgorithmscanbeneﬁtfromourkernelmechanisms.Webelievethattheusageofaninfrastructurewhichal-lowsnon-intrusiveuser-levelprioritycontrolisnotneces-sarilylimitedtosmall-scaleSMPs.Inanycase,theapplica-bilityofsuchmechanismsinlarger-scalesystemsrequiresthoroughinvestigation,giventhefactthatsuchsystemsgen-erallyhavedifferentarchitecturalcharacteristicsandmayposesigniﬁcantlyhigherthreadmigrationandremotemem-oryaccesscosts.

[5]

[6][7][8][9]

[10]

[11]

[12]

[13][14][15]

References

[1]T.E.Anderson,B.N.Bershad,E.D.Lazowska,andH.M.

Levy.Scheduleractivations:Effectivekernelsupportforthe

user-levelmanagementofparallelism.ACMTransactionsonComputerSystems,10(1):53–79,February1992.

C.D.Antonopoulos,D.S.Nikolopoulos,andT.S.Pap-atheodorou.Informingalgorithmsforefﬁcientschedulingofsynchronizingthreadsonmultiprogrammedsmps.Tech-nicalReportHPCLAB-TR-220101,HighPerformanceIn-formationSystemsLaboratory,UniversityofPatras,Patras,Greece,January2001.

D.Black.Schedulingsupportforconcurrencyandparal-lelisminthemachoperatingsystem.IEEEComputer,23(5),1990.

A.Gupta,A.Tucker,andS.Urushibara.Theimpactofoper-atingsystemschedulingpoliciesandsynchronizationmeth-odsontheperformanceofparallelapplications.InProceed-ingsofthe1991ACMSIGMETRICSconferenceonMea-surementandmodelingofcomputersystems,pages120–132,1991.

A.Karlin,K.Li,M.Manasse,andS.Owicki.Empiricalstudiesofcompetitivespinningforashared-memorymulti-processor.InProceedingsofthe13thACMSymposiumonOperatingSystemPrinciples(SOSP’91),pages41–55,oct1991.

L.I.Kontothanassis,R.W.Wisniewski,andM.L.Scott.Scheduler-conscioussynchronization.ACMTransactionsonComputerSystems,15(1):3–40,February1997.

B.-H.LimandA.Agarwal.Waitingalgorithmsforsynchro-nizationinlarge-scalemultiprocessors.ACMTransactionsonComputerSystems,11(3):253–294,August1993.

B.D.Marsh,M.L.Scott,T.J.LeBlanc,andE.P.Markatos.First-classuser-levelthreads.In13thACMSymposiumonOperatingSystemsPrinciples,October1991.

J.M.Mellor-CrummeyandM.L.Scott.Algorithmsforscalablesynchronizationonshared-memorymultiproces-sors.ACMTransactionsonComputerSystems,9(1):21–65,1991.

M.M.MichaelandM.L.Scott.Relativeperformanceofpreemption-safelockingandnon-blockingsynchronizationonmultiprogrammedsharedmemorymultiprocessors.InProceedingsofthe11thInternationalParallelProcessingSymposium,April1997.

D.S.Nikolopoulos,C.D.Antonopoulos,I.E.Venetis,P.E.Hadjidoukas,E.D.Polychronopoulos,andT.S.Pa-patheodorou.Achievingmultiprogrammingscalabilityonparallelprogramsonintelsmpplatforms:Nanothreadinginthelinuxkernel.InProceedingsofthePARCO99Confer-ence,Delft,Netherlands,June1999.

C.Polychronopoulos,N.Bitar,andS.Kleiman.Nanoth-reads:Auser-levelthreadsarchitecture.TechnicalRe-portCSRD-1297,CSRD,UniversityofIllinoisatUrbana-Champaign,1993.SGI.TopicsInIrixProgramming,Chapter4.http://techpubs.sgi.com.

SunMicrosystems.Solaris8ReferenceManualCollection,Section3:ThreadsandRealtimeLibraryFunctions,March2000.http://docs.sun.com.

S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta.Thesplash-2programs:Characterizationandmethodologi-calconsiderations.InProceedingsofthe22ndAnnualInter-nationalSymposiumonComputerArchitecture(ISCA’95),pages24–36,June1995.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文