onMultiprogrammedSMPs
ChristosD.AntonopoulosDimitriosS.NikolopoulosTheodoreS.PapatheodorouHighPerformanceInformationSystemsLaboratoryComputersandSystemsResearchLabComputerEngineeringandInformaticsDepartmentCoordinatedScienceLaboratory
UniversityofPatrasUniversityofIllinoisatUrbana-Champaign26500Patras,GREECE1308WestMainStr.,Urbana,IL61801,U.S.A.cda,tsp@hpclab.ceid.upatras.grdsn@hpclab.ceid.upatras.gr
Abstract
Wepresentnovelalgorithmsforefficientschedulingof
synchronizingthreadsonmultiprogrammedSMPs.Theal-gorithmsarebasedonintra-applicationprioritycontrolofsynchronizingthreads.Werefertosuchalgorithmswiththeterm”informingalgorithms”.Prerequisiteforinform-ingalgorithmsistheuseofanefficientcommunicationmediumbetweentheuser-andkernel-levelandtheexis-tenceofin-kernelmechanismsthatallowtheapplicationstocooperatewiththeOSscheduler.Theapplicationsaregiventheopportunitytoinfluence,inanon-intrusiveman-ner,theschedulingdecisionsconcerningtheirthreads.Wecomparetheperformanceofourinformingalgorithmswiththeperformanceofcorrespondingscheduler-obliviousal-gorithmsundermultiprogramming.Weexperimentedonasmall-scale,Intelx86-basedSMP,runningLinux,usingbothmicrobenchmarksandapplicationsfromtheSplash-2benchmarksuite.Theresultssubstantiatethesuperiorityofourapproachandindicatethatthephilosophyofinformingalgorithmsmaybeapplicableinawiderangeofalgorithmsandarchitectures.
mainlyfromthepoorschedulingofsynchronizingthreadsonphysicalprocessors[4],[6].Idlingtimeatsynchroniza-tionpointsmayconstituteasignificantfractionofexecu-tiontimeifoneormoreofthefollowingconditionshold:a)athreadispreemptedwhileholdingalock,b)apre-emptedlock-waiterthreadremainspreemptedafterithasbeengrantedthelockbythepreviousholder,orc)athreadactivelywaitsforotherthreadstoreachaspecificpoint(i.e.abarrier),whilesomeofthesethreadsarepreempted.Theperformancedegradationisworseforsynchroniza-tionalgorithmsdesignedtoprovideperformancescalabilityonalargenumberofprocessorsorfairarbitrationbetweencontendingprocessors[6].Thesealgorithmsintroduce-ei-therimplicitlyorexplicitly-astrictorderingofsynchro-nizationoperationsperformedbydifferentthreads.How-ever,thisorderinggenerallydiffersfromtheorderthreadsarescheduledonphysicalprocessorsbytheOSscheduler.Theproblemofschedulingsynchronizingthreadsinamultiprogrammingenvironmenthasnotbeenadequatelyaddressed,ifatall,incontemporarycommercialSMPschedulersforsmall-andmedium-scalesystems.Inthispaper,weintroduceakernel-levelinfrastructureofgeneralapplicability.Thisinfrastructurearmsmultithreadedappli-cationswithmechanismsthatallowthemtocontrol,inanon-intrusivemanner,theexecutionoftheirownthreads.Inparticular,eachapplicationassignsuser-levelprioritiestoitsthreads.Theseprioritiesreflecttheimportanceofeachthreadfortheprogressoftheapplication.TheOSsched-uler,onitsturn,assignsphysicalprocessorstoapplicationsbasedontheschedulingpolicyimposedbytheoperatingsystemdesigners.Theprioritiesprovidedbytheapplica-tionsaretakenintoaccountduringtheallocationofphysicalprocessorstospecificthreads.TheefficientcommunicationbetweenapplicationsandtheOSschedulerisfacilitatedbytheuseofmemorypages,sharedbetweenuser-andkernel-level[11],[12].Theuser-levelprovidedprioritieshaveap-
1Introduction
Atypicalclassofapplicationssufferingsevereperfor-mancedegradationinthepresenceofmultiprogrammingaremultithreadedapplicationswithfrequentlysynchro-nizingthreads[6].Ithasbeenshownthattheproblemofpoorperformanceundermultiprogrammingoriginates
ThisworkhasbeensupportedbytheHellenicGeneralSecretariatofResearchandTechnology(G.S.R.T.)researchprogram99-E566.
ThisworkhasbeencarriedoutwhilethesecondauthorwaswiththeHighPerformanceInformationSystemsLaboratory,UniversityofPatras,Greece.
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
plicationwidescope,thustheydonotinterferewithOSschedulingdecisionsconcerningotherapplications.ThisguaranteesthatthefairnessimposedbytheOSschedulerisnotaffected.Moreover,thisapproachmakeskernel-levelmechanismsindependentofandoblivioustothealgorithmusingthem.Themechanismscanthusbeusedbydiverseapplications,providedthatthelatterarecapableofquan-tifyingtheimportanceofeachthreadfortheprogressoftheapplication.Takingadvantageofthisproperty,wehaveimplementedsixdifferentinformingsynchronizationalgo-rithmsusingacommonsetofkernelmechanisms.Oural-gorithmscanbeusedforintra-applicationsynchronizationonly.Extendingthemtosupportinter-applicationsynchro-nizationwouldharmthenon-intrusivenessproperty,giventhefactthatapplicationswouldbeallowedtoaltertheuser-levelprioritiesofthreadsbelongingtootherapplications.Thecharacteroftheinformingsynchronizationalgo-rithmspresentedinthispaperisproactive.Theyguaranteethatnolock-waitersorbarrier-waiterswillbegrantedapro-cessorwhilethereexist,inthesameapplication,preemptedlock-holdersorthreadsthathavenotreachedthebarrieryet.ThetargetoperatingsysteminthisstudyisLinux.WehavemodifiedtheLinuxkernel(version2.2.15)inordertoprovidethegeneralmechanismsneededbyinformingal-gorithms.Bothinformingsynchronizationalgorithmsandthecorrespondingscheduler-obliviousversionsareimple-mentedandprovidedasarun-timesynchronizationlibrary.Therestofthispaperisorganizedasfollows.Section2discussesrelatedwork.Insection3wepresentdetailsonthekernel-leveldesignandimplementation.Insection4wedescribethenovelinformingalgorithmsandtheircharacter-istics.Insection5wepresentanexperimentalevaluationofboththenovelalgorithmsandthecorrespondingscheduler-obliviouscounterpartsonanIntelx86-basedSMPandsec-tion6concludesthepaper.
2RelatedWork
Theperformancedegradationsufferedbyparallelpro-gramswithfrequentsynchronizationamongtheirthreads,especiallyinthepresenceofmultiprogramming,hasbeenidentifiedbyseveralresearchersinthepast.Inthissection,weoutlinesomeoftheproposedsolutions.
Thedecisiononwhetherathreadthatwaitsatasynchro-nizationpointshouldactivelyspin,competitivelyspin(spinforacertaintime-windowandtheblock)orimmediatelyblockhasbeenahot-spotinresearch.In[5],A.Karlinetal.reachtheconclusionthatadaptivecompetitivespinningal-gorithmsgenerallyoutperformstaticcompetitivespinningones.Thelatterareconsidered,intheirturn,betterthanactivelyspinningandimmediateblockingalgorithms.TheevaluationhasbeencarriedoutontheFireflyMultiproces-sor.However,in[7]B.LimandA.Agarwalreachadiffer-
entconclusion.TheirexperimentsontheAlewifemultipro-cessorshowthatimmediatelyblockingalwayssustainsper-formanceclosetothebestamongallstrategiescompared.ThecontradictionofresultsisattributedtothefactthatthecontextswitchonAlewifeissignificantlyfasterthanonFirefly,whichbenefitsimmediateblockingalgorithms.Anotherapproachistheuseofnon-blockingsynchro-nizationalgorithms.Suchalgorithmshavebydefinitionnocriticalsections,duringwhichathreadpreemptionmightcauseperformanceproblems.Theyareeasilyapplicabletosimpledatastructuresandhaveproventobequiteef-ficientundermultiprogramming[10].Unfortunately,theyarebasedonuniversalatomicprimitiveswhichcanatomi-callychangethecontentsofshort(typicallyupto64bits),continuousmemoryareas.Ifthedatastructureisnotthatshort,morecomplexalgorithmsmustbeused.Thesealgo-rithmsinvolvecopyingofthecontentsofthedatastructureandaregenerallyinefficient.
Therehasbeenworkinthepasttoprovidesynchroniza-tionalgorithmsthatcooperatewiththeOSscheduler.In[6]L.Kontothanassisetal.presentscheduler-consciousver-sionsforavarietyofalgorithms.Theyassumethatathreadcaninfluencetheschedulerenoughtopreventitselffrombeingpreemptedinashortcriticalsection.Itistheapplica-tion’sresponsibilitytocopewiththedecisionsofthesched-uleroverlongerperiodsoftime.However,theapplicationisgiven-evenforshorttimeperiods-theopportunitytoin-fluenceschedulingdecisions,possiblyattheexpenseoftheotherapplications.Moreover,thescheduler-consciousal-gorithmsalterbasicpropertiesoftheirscheduler-obliviouscounterparts.Forexample,theydonotpreservetheFCFSpropertyofseveralmutualexclusionalgorithms.
In[3]D.Blackproposesapriorityschemewithuser-levelprovidedprioritiesforthreadsthatcontendformutualexclusiveaccesstoatomicregions.Theseprioritieshaveglobalscope,amongallthreadsinthesystemandtheirroleistoaposterioriconfrontwithinopportunepreemptionsoflock-holderprocesses.Theglobalscopeofuser-levelpro-videdprioritiesisunacceptableinareal,multiuseroperat-ingsystem,becauseitprovidesameansofoverridingthestandardschedulingpolicythusharmingfairness.AnothersimilaraposterioriapproachispresentedbyT.Andersonetal.in[1].Ifathreadexecutinginacriticalsectionisfoundpreempted,itisresumed-viaauser-levelcontextswitch-andexecuteduntilexitingthecriticalsection.Fi-nally,in[8]B.Marshetal.introducea”twominute”warn-ingmechanism,usedtoavoidinopportunepreemptionsofthreadsexecutingcodeinacriticalsection.Thethreadsareinformedthattheyaregoingtobepreemptedandaregranteda”grace”timeperiodtoexitthecriticalsection.Barriersynchronizationisnotdealtwithinanyofthethreepreviousapproaches.
TheSunSolarisoperatingsystemsupportsamechanism
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
whichallowsathreadtoprovidehintstotheOSsched-ulerthatitshould,ifpossible,notbepreempted[14].Thismechanismcanbeusedfortheimplementationofweakpreemption-safelocks.TheCellularIRIX[13]andLinuxoperatingsystems,ontheotherhand,providecompetitivespinningsynchronizationprimitives.
nismsareabsolutelyindependentofthealgorithmsusedatuser-level.TheOSkernelisnotprovidedwithanyinfor-mationanddoesnotmakeanyassumptionsonthenatureofthealgorithm.This,makesourkernelmechanismsusablebyamultitudeofinformingalgorithms.
3Kernel-LevelDesignandImplementation
4
TheImplementationofInformingSyn-chronizationAlgorithms
Thekernel-levelimplementationhasbeencarriedoutinthe2.2.15Linuxkernelandoriginatesfromourpreviousworkonthesameoperatingsystem[11].Theapplicationsdeclarethattheyaregoingtousethemechanismsweintro-duceviaaspecificsystemcallatthebeginningoftheirex-ecutionlife.Kernelmechanismsarebasedontheexistenceanduseofanefficientuser-/kernel-levelbidirectionalcom-municationinterface(SharedArena).TheSharedArenaisimplementedasasharedmemorypagebetweeneachappli-cationinthesystemthatusesourmechanismsandtheOSkernel.Thisimplementationhastheadvantageofminimalcommunicationoverhead,equaltothatofsimplememoryreadsandwrites.TheapplicationsusetheSharedArenainordertoassignprioritiestotheirthreads.Thekernel,onitsturn,usesittoinformtheapplicationsonthenumberofprocessorsallocatedtothematanygiventimeandontheexactstateoftheirthreads(running,preempted).
Attheendofeachtimequantumorwhenaprocessorisidling,theoperatingsystemschedulerselectsthemostappropriatethreadtobeexecutedonthatprocessor.ThisstepoftheselectionprocessenforcestheOSschedulingrules.Iftheselectedthreadbelongstoanapplicationwhichusesourmechanisms,asecondlevelschedulerisinvoked.Thatschedulerselectsthethreadwiththehighestpriority,asspecifiedfromuser-level,amongthethreadsofthesameapplicationandallocatestheprocessortoit.Thispolicyguaranteesthateachapplicationwillbegrantedexactlytheexecutiontimeitwouldbegrantedwithoutourkernelex-tensionsthuspreservingfairnessamongapplications.Atthesametime,itallowsapplicationstousetheircpu-timeinamoreeffectiveway.
Everydecreaseoftheuser-levelpriorityofathreadmayresultinasituationwherehigherprioritythreadsarepre-empted.Ifthisisthecase,thethreadhandsoffitsproces-sorinfavorofthehigherprioritythread.Thisisachievedbyexecutingasystemcallwhichpracticallyinitiatesanewsecondlevelschedulingforthespecificapplication.
Theresultsofallschedulingdecisionsthataffectappli-cationsusingourmechanismsarecommunicatedimmedi-atelytotheuser-levelviaupdatesofthevaluesoftheap-propriatefieldsintheSharedArena.
Theuser-levelprioritycontrolisclearlynon-intrusive,inthesensethatanapplicationcannotaffecttheschedulingofotherapplicationsinthesystem.Furthermore,themecha-Thenovelsynchronizationalgorithmswepresenttakeadvantageofthekernelmechanismsdescribedinsection3.TheyprovidehintstotheOSscheduler,viaprioritiesas-signedtothreadsfromuser-level,inordertoimprovetheschedulingofsynchronizingthreads.Thealgorithmsarerepresentativeoftwoofthemostfrequentlyusedsynchro-nizationschemesinsharedmemorysystems:mutualexclu-sionandbarriersynchronization.Wehaveimplementedasimpletestandsetlockwithlocalspinning(TTASLock),anarraybasedqueueinglock(QueueLock),avariantofthelistbasedqueueinglockproposedbyMellor-CrummeyandScott(MCSLock)[9],aticketlock(TicketLock),acentralizedincrementalbarrier(IncrementalBarrier)andatreebarrierbasedonthealgorithmproposedbyMellor-CrummeyandScott(MCSBarrier)[9].
Theorganizationofthepriorityclassesisbasedonthefollowingrules:a)Athreadthatholdsalockcannotbepreemptedinfavorofanyotherthread.b)Athreadthatwaitsforalockwhichiscurrentlyownedbyanotherthreadisnotexecutedattheexpenseofathreadnotparticipatinginsynchronizationoperations.Thethreadthatwaitsforalockwillnotexecuteanyusefulcomputation,whileathreadthatisnotsynchronizingwithothersprobablywill.Theonlyexceptionarelock-waiterswhichwillbegrantedthelocknext,inalgorithmsthatposeanorderingoflockacquires.Insuchcases,processorutilizationcanbesacrificedduringshortperiods,inordertoensurethatthethreadwillpro-ceedtotheexecutioninsidethecriticalregionassoonasitishandedthelock.c)Athreadthathasreachedabarrierandwaitsforitspeerthreadstoreachthebarrieraswell,isassignedlowerprioritythanthreadsthatdonotparticipateinsynchronizationoperations.Arun-timesystemcannotassumewhetherthreadsthatcurrentlyexecutecomputationwithoutsynchronizingwill,attheendofthecomputationphase,participateinabarrierornot.Thismeansthatthepriorityofthelattercannotbeincreased.Theonlypossi-blealternativeistodecreasethepriorityofthreadsthathavealreadyreachedthebarrier.
Inourimplementationwedefineandusefivepriorityclasses.Thebasicpriority(RUNNINGPRIORITY)isthepriorityofthreadsthatdonotcurrentlyparticipateinanysynchronizationoperation.Thehighestpriorityisassignedtothreadsthatholdalock(LOCKHOLDERPRIORITY).Threadsthatwaitforalockconstitutethepriorityclass
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
belowbasicpriority(LOCKWAITERPRIORITY).Ifamutualexclusionsynchronizationalgorithmhandsthelocktowaiterthreadsindeterministicorder,thepri-orityofthethreadtobegrantedthelocknextissettoLOCKIMMEDIATEWAITERPRIORITY.ThispriorityclassislowerthanLOCKHOLDERPRIORITYandequalto,orhigherthanRUNNINGPRIORITY.Finally,threadsthathavereachedabarrierandwaitfortheirpeerstoreachthebarriertoo,constitutetheweakestpriorityclass(BAR-RIERWAITERPRIORITY).
Rightbeforeparticipatinginasynchronizationoperationeachthreadsavesthevalueofitsuser-levelprovidedprior-ity.Thesavedvalueisusedtorestorethepriorityrightafterthesynchronizationoperationisover.Thistechniqueallowsthreadstoparticipateinnestedlevelsofsynchronization.Theassignmentofuser-levelprovidedprioritiestothreadsisachieved,inmostcases,withsimplewritein-structions.However,agoalofourdesignistoeliminatethepossibilityofallowing,atanytimesnapshot,theprior-ityassignmenttobeinconsistentwiththeactualstatusofthethreads,inthecontextofsynchronizationoperations.Thismighthappenduetoraces,iftwoormorethreadsattempt,atthesametime,tochangeauser-levelpriority.Suchanin-opportunepriorityassignmentcouldresultinperformanceinferiorthantheperformancesustainedbynon-informingsynchronizationalgorithms.Inordertoeliminatethispos-sibility,wehaveidentifiedthecasesinwhicharacemightoccurandusedatomicinstructionsforthepriorityupdate.Inallcases,iftherearecontendingthreadstheyareexactlytwoandtheatomicinstructionisexecutedonlyonce.Oneofthecontenderswillsucceedandtheotherwillfailandaborttheoperation.Thismeansthattheadditionalover-headonthememorysubsystemisnegligible.
Beyondinformingalgorithms,wehaveimplementedtwoversionsofscheduler-obliviousalgorithms:anactivespin-ningandanimmediateblockingone.Asmentionedinsec-tion2,thedecisionwhetherathreadwaitingatasynchro-nizationpointshouldactivelyspin,competitivelyspinorblockiscrucialforperformance.Ouralgorithmsmakeanoptimalspinvs.blockdecision:athreadwillblockonlyiftheapplicationisgrantedlessprocessorsthanitsthreadsandthereisapreemptedthreadofhigheruser-levelpriority.Competitivespinningstrategieshavenotbeenevaluated.Thecontextswitchoverheadforthesystemwehaveexper-imentedonrangesfrom3to22sec,soweexpecttheper-formancedifferentiationamongcompetitivespinningandimmediateblockingalgorithmstobemarginal[7].
Itmustbenotedthatinformingalgorithmspreserve-incontradictionwiththescheduler-consciousalgorithmsofL.Kontothanassisetal.[6]-themaincharacteristicsoftheirsimplecounterparts(timecomplexity,memoryandnetworkoverhead,FCFSserviceofrequestsetc.).Theirmemoryre-quirementsaregenerallyhigherthatthoseofsimplealgo-rithms.However,theyare-withtheexceptionsofTicketLockandIncrementalBarrier-ofthesamecomplexityinrespecttothenumberofsynchronizingthreads.
Inthefollowingsubsectionwedescribethecharacter-isticsandimplementationoftheinformingMCSLockal-gorithm.Thepresentationoftheremainingfiveinformingsynchronizationalgorithms,whichhasbeenomittedduetospacelimitations,canbefoundin[2].
4.1MCSLock
TheMCSlockisalistbasedqueueinglock,proposedbyJ.Mellor-CrummeyandM.Scott[9].ItguaranteesFIFOorderingoflockacquisitions.Allspinsareexecutedonlocally-accessibleflagvariablesonly.Itworksequallywell-intermsofnetworkoverhead-onmachineswithandwith-outcoherentcaches.Eachlockrequiresmemoryspacepro-portionaltothenumberofcontendingthreads.
Eachthreadthatattemptstoacquirethelockinsertsanode,whichrepresentsthethread,tothetailofaqueueofwaitingthreads.Thenextfieldofthenode,whichisapointertothenextnodeinthequeuemustbeinitial-izedtoNULLpriortotheinsertion.Thenodeinsertionisachievedwithanatomicfetchandstoreinstruction.Ifthequeuewaspreviouslyempty,thethreadisthelock-holderandproceedstothecriticalsection.Ifthiswasnotthecase,thethreadsetsthelockedfieldofitsnodetotrue,toindicatethatitisalock-waiterandmakesthenextfieldofitspredecessornodepointtoitsnode.Thepredecessoristhenodereturnedbytheatomicfetchandstorein-struction.Then,thethreadwaits(eitherspinning,orblock-ing)foritsnextfieldtobecomefalse.
Inordertoreleasethelock,thelock-holderinitiallychecksifitsnode’snextfieldpointstoasuccessornode.Ifthereinnosuccessorset,thelock-holdertriestoatom-ically(compareandswap)setthetailofthequeuetoNULL.Asuccesscompletesthelockrelease.Afailureim-pliestheexistenceofathreadwhichhasinserteditsnodeatthetailofthequeue,buthasnotupdatedthenextfieldofitspredecessoryet.Inthiscase,thelock-holderthreadwaitsfortheupdatetocompleteandthensetsthelockedfieldofthesuccessortofalse.Iftheinitialexaminationofthenextfieldindicatestheexistenceofasuccessor,itslockedfieldissimplysettofalse.
ThepseudocodeofaninformingMCSlockisdepictedinfigure1.Thenodeofeachthreadisaugmentedwithtwoadditionalfields:thethreadidentifier(id)andthefieldusedforsavingtheinitialuser-levelpriority(pre-viouspriority).Athreadthatattemptstoacquirethelockinitiallysavesitsidanduser-levelprioritytothecorrespondingfieldsofitsnode(line5).ItthensetsthepointertothenextnodeequaltoNULL(line6)andin-sertsthenodetothetailofthequeueexecutinganatomic
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
1234567
mcs_lock_init(lock_queue){
Initializetheheadoflock_queuetoNULL;}
mcs_lock(lock_queue,my_node,my_id){
Saveboththeprevioususer-levelpriorityofmy_idthreadandmy_idtomy_node.previous_priorityandmy_node.id;Setmy_node.next=NULL
Executeafetch_and_storeatomicinstructiontoinsertmy_nodetothetailofthelock_queueandgetthepreviousvalueofthetail(predecessor);
IfthepredecessorisnotNULL{
Setmy_node.locked=truetoindicatethatthisthreadisnotthelockowner;
Ifthepredecessoristhelockowner{
Settheuser-levelpriorityofthisthreadtoLOCK_IMMEDIATE_WAITER_PRIORITY;
Putmy_nodeinthequeueafterpredecessor(predecessor.next=my_node);}
else{
Putmy_nodeinthequeueafterpredecessor;
Ifthethreaddidnotinthemeantimebecomethelockowner,trytoatomically(compare_and_swap)setitsuser-levelprioritytoLOCK_WAITER_PRIORITY;}
Whilethisthreadisnotthelockowner(my_node.locked=true){
Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisa
preemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;}}
Thisthreadisnowthelockowner:Setitsuser-levelprioritytoLOCK_HOLDER_PRIORITY;
Ifthereisasuccessorofthisthread’snodeinthelock_queue(my_node->next<>NULL){
Setitsuser-levelprioritytoLOCK_IMMEDIATE_WAITER_PRIORITY;}}
mcs_unlock(lock_queue,my_id){
Ifthereisnosuccessornodeinthelock_queue(my_node.next=NULL){
Ifwearesuccessfulatatomically(compare_and_swap)changingthelock_queuetailfrompointingtomy_nodetoNULL{
Restoretheuser-levelpriorityofthisthreadusingthevaluepreviouslysavedinmy_node;
Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisapreemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;Return;}
else{
Somenodeistryingtoenterthequeueasoursuccessor.Waitforit.Inthemeantime,settheuser-levelpriorityofthisthreadtoLOCK_IMMEDIATE_WAITER_PRIORITY(equaltothatofthethreadtobeinserted).Ifthegranted
processorsarelessthanthethreadsoftheapplicationandtherearepreemptedthreadswithhigheruser-levelpriority,handofftheprocessortothem;}}
SetthepriorityofthisthreadtoLOCK_HOLDER_PRIORITYagain;Setthepriorityofthisthread’ssuccessor(succ_node)inthelock_queuetoLOCK_HOLDER_PRIORITY;
Setsucc_node.lockedtofalsetoindicatethatthatthreadisnowthelockowner;
Restoretheuser-levelpriorityofthisthreadusingthepreviouslysavedvalue;
Iftheprocessorsgrantedbytheschedulerarelessthanthethreadsoftheapplicationandifthereisapreemptedthreadwithhigheruser-levelprioritythanthisthread,handofftheprocessortothehigheruser-levelprioritythread;}
8910111213141516
171819
202122232425262728293031
32333435
36373839404142
43
Figure1.InformingMCSLockpseudocode
fetchandstoreinstruction.Iftherewasanodeinthewait-queuepriortotheinsertion,itisexaminedwhetherthatnodeisthelock-holderornot.Ifthepredecessoristhelock-holder,theuser-levelpriorityofthecurrentthreadissetequaltoLOCKIMMEDIATEWAITERPRIORITYandthenodeisinsertedinthequeueafterthepredeces-sor(predecessor.next=mynode,lines10-13).Ifthisisnotthecase,thethreadplacesitsnodeinthequeueafterthepredecessorandtriestoatomicallysetitsuser-levelprioritytoLOCKWAITERPRIORITY(lines14-17).Inanycase,afterthepriorityassignment,thethreadpollsthe
valueofthelockedfielduntilthelatterisfoundfalse.Inthemeantime,thethreadmayhandoffitsprocessortootherthreads,shouldthatbenecessary(lines18-20).Whenthethreadiseventuallygrantedthelock,itincreasesitsuser-levelprioritytoLOCKHOLDERPRIORITY.Ifthereisasuccessornodeinthequeue,theuser-levelpriorityofthecorrespondingthreadissetequaltoLOCKIMMEDIATEWAITERPRIORITY.
Duringthelockrelease,thethreadexamineswhetherthereisasuccessornodeinthewait-queue(line28).Ifthereisnot,thethreadtriestoatomically(com-pareandswap)setthetailofthequeue,whichcurrentlypointstoitsnode,toNULL(line29).Iftheatomicin-structionissuccessful,theuser-levelpriorityofthethreadisrestoredusingthevaluesavedinpreviouspriority.Thethreadthencheckswhethertherearepreemptedthreadsofhigheruser-levelpriorityandifthisisthecaseithandsofftheprocessortooneofthem(lines30-33).Afailureoftheatomicinstruction(line29)indicatestheexistenceofathreadwhichhasinserteditsnodeatthetailofthewait-queue,buthasnotyetupdatedthenextfieldofitspredecessor.Thelock-holderdecreasesitsuser-levelprior-itytoLOCKIMMEDIATEWAITERPRIORITYandhandsoffitsprocessor,inordertogivethethreadtryingtoenterthequeuetheopportunitytoexecuteandupdatethenextfieldofthelock-holdernode(lines34-36).If(orwhen)thelock-holderhasasuccessornode,itrestoresitsuser-levelprioritytoLOCKHOLDERPRIORITYandsetsthepriorityofthesuccessortothesamevalue(lines38-39).Consequently,thelockedfieldofthesuccessornodeissettofalse,inordertograntthelocktothecorrespond-ingthread(line40).Finally,thepreviouslock-holderre-storesitsuser-levelpriorityusingthevaluesavedinpre-viouspriorityandhandsoffitsprocessorinfavorofanotherthread,shouldthatbenecessary(lines41-42).
Theuseofanatomicinstructioninline16helpseliminateapotentialracehazardwhichwouldoccurifthepredecessorthreadwasgrantedthelockafterthecheck.Shouldthathappen,thepredecessorwouldattempttosettheuser-levelpriorityofthisthreadtoLOCKIMMEDIATEWAITERPRIORITY.Thenextfieldofthepredecessormustbeupdatedpriortodecreas-ingthepriority(line15).Doingtheoppositecouldre-sulttoimplicationsifthethreadispreemptedbeforeup-datingthenextfieldofthepredecessor.Iftheprede-cessorthreadisthelock-holder,itsnextfieldcanbeupdatedafterthepriorityassignment,asitsnewpriority(LOCKIMMEDIATEWAITERPRIORITY)willbehighenoughtoavoiddeadlock.Thisraisestheneedofanatomicinstructionforthepriorityassignment.
Failingtoincreasethepriorityofthelock-holderinthelock-releasephase(line38)couldresulttoundesirableim-plications(evendeadlock,ifonlyonephysicalprocessor
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
Execution Time (seconds)hasbeengrantedtotheapplication),ifthelock-holderispreemptedaftertheincreaseoftheuser-levelpriorityofitssuccessor.Insuchacase,thecurrentlock-holderwouldnotbegiventheopportunitytosetthelockedfieldofitssuc-cessortofalse,inordertoallowthelattertoenterthecriticalsection.
10090807060504030201001Cholesky5ExperimentalEvaluation
MCS KernelMCS SpinMCS BlockQueue KernelQueue SpinQueue BlockTTAS KernelTTAS SpinTTAS BlockTicket KernelTicket SpinTicket Block824Multiprogramming DegreeInthissection,wepresentanevaluationoftheperfor-manceofinformingsynchronizationalgorithms.Weareparticularlyconcernedwiththeirbehaviourundermultipro-gramming.Thecomparisoninvolvesinformingalgorithmsandtheirsimplecounterparts.
WehavecarriedoutourevaluationonaCompaqProliant5500system.Itisequippedwith4PentiumProproces-sorsclockedat200MHzwith512KBytesL2cacheeach.Themainmemoryis512MBytes.ForthepurposesofourevaluationwehaveusedbothsyntheticmicrobenchmarksandrealcomputationalkernelsandapplicationsfromtheSplash-2benchmarksuite[15].Theworkloadsconsistofoneormoreidenticalinstancesofamicrobenchmarkoranapplication.Eachinstancerequires4processorsandthenumberofconcurrentlyexecutinginstancesisequaltothedesireddegreeofmultiprogramming.Therangeofmul-tiprogrammingdegreeswehaveexperimentedwithspansfrom1to8.
Thelockmicrobenchmarksconsistofamainloopwithalternatingsynchronizingandworkingphases.Onlyonethreadcanbeinthesynchronizingphaseatanytimesnap-shot.Thesynchronizingandworkingphasesconsistof1000and150010%locallycachedmemoryupdatesre-spectively.Thelengthofthemainloopis4096iterations,whichareevenlydistributedamongtheprocessors.Themi-crobenchmarkisexecuted32timesandthereportedtimeisthemeanvalueofallexecutions.Thebarriermicrobench-marksareidenticaltothelockmicrobenchmarks,withthedifferencethatthemainloopconstitutesoftheworkingphase,describedearlier,followedbyabarrier.Thebarrierisimplementedusingthealgorithmunderevaluation.
Inordertodemonstratethattheperformancegainsachievedbyourinformingsynchronizationalgorithmscanbeconsiderableinrealapplicationsaswell,wehaveusedtwocomputationalkernels(LU,Cholesky)andoneappli-cation(Radiosity)fromtheSplash-2benchmarksuite.
TheLUkernelfactorsadensematrixintotheproductofalowerandanuppertriangularmatrix.Blockingtech-niqueshavebeenusedinordertoexploittemporallocalityandreducecommunication.Approximately20to50per-centoftheexecutiontimeisspentonbarriersynchroniza-tionoperations[15].Wehavedecomposeda1024x1024matrix,using16x16blocks.TheCholeskykernelimple-mentsblocked,sparseCholeskyfactorization.Itfactors
1200RadiosityExecution Time (seconds)10008006004002000124Multiprogramming Degree8MCS KernelMCS SpinMCS BlockQueue KernelQueue SpinQueue BlockTTAS KernelTTAS SpinTTAS BlockTicket KernelTicket SpinTicket Block400350LUExecution Time (seconds)300250200150100500124Multiprogramming Degree8Incr KernelIncr SpinIncr BlockMCS KernelMCS SpinMCS BlockFigure2.ExecutiontimesofSplash-2appli-cationsusinglock(Cholesky,Radiosity)andbarrier(LU)algorithms
asparsematrixtotheproductofalowertriangularma-trixanditstranspose.Itisnotgloballysynchronizedbe-tweensteps.Thesynchronizationisachievedusinglocksanditsoverheadrangesbetweenapproximately10and70percentofthetotalexecutiontime[15].Wehaveusedthestandardtk29.Oinputfile,distributedwithCholesky.TheRadiosityapplicationcomputestheequilibriumdistri-butionoflightinascene,usingtheiterativehierarchicaldiffuseradiositymethod.Thestructureofthecomputationishighlyirregular.Thesynchronizationoperations(locks)contributeapproximately15to30percenttotheexecutiontime[15].WehaveexecutedRadiosityinbatchmodeforthestandardroomscene.Theresultsattainedfromthemul-tiprogrammedexecutionofSplash-2computationalkernelsandapplicationsusinginforming(Kernel),activespinning(Spin)andimmediateblocking(Block)synchronizational-gorithmsaredepictedinfigure2.
Themicrobenchmarkresults(notshown)indicatethat,
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
forthetargetarchitecture,simplealgorithms(TTASlockandIncrementalBarrier)significantlyoutperformmorecomplexonesinthepresenceofmultiprogramming.Fornon-informingsynchronizationalgorithmsthiscanbemainlyattributedtothefactthatsimplealgorithmsdonotimposeastrictorderinginthesynchronizationoperationsexecutedbydifferentthreads,thustheyaremoreinsensitivetoinopportuneschedulingdecisions.Inthecaseofinform-ingalgorithmstheexecutiontimeofthelongerinstructionsequencesrequiredbymorecomplexalgorithmscontributesadominantfractionofthetotalexecutiontime,thusresult-ingtoinferiorperformanceincomparisontosimplealgo-rithms.Theperformanceofcomplexandsimplealgorithmsinadedicatedmachineisalmostindistinguishable.How-ever,giventhefactthatcomplexalgorithmsaredesignedforscalabilityonmanyprocessors,weexpectthemtoperformbetterthantheirsimplecounterpartsinalarger,dedicatedmachine.InallcasesbutTTASlockwithmultiprogram-mingdegree2,informingalgorithmsoutperformscheduler-obliviousonesundermultiprogramming.Ourinformingal-gorithmsforlocksexecuteupto348.4timesfaster(average177.6timesfaster)incomparisontospinningalgorithmsandupto8timesfaster(average5.2timesfaster)incom-parisontoimmediateblockingones.Thespeedupattainedforinformingbarrieralgorithmswhencomparedtospin-ningonesisupto1356.4(average512.4).Thecomparisonwithimmediateblockingalgorithmsleadstospeedupsupto10.5(average4.8).Theseperformancegainsareindica-tiveofthecorrectnessofourapproach.Webelievethat,inthepresenceofmultiprogrammingdegree2,thealwaysspinningversionoftheTTASlockperformsbetter(0.35sec/16.7%)thantheinformingTTASlockbecausethecon-tentionisnothighenoughforthebenefitsofourmecha-nismstooutweightheextracostofadditionalinstructions,contextswitchesandcachemissesduetothreadmigrations.MoreovertheinformingTTASlockcannotfullybenefitfromourmechanisms,duetoitsnondeterministicnature.Inadedicatedmachine,themicrobenchmarksthatuseinformingalgorithmsmayneedupto0.07seconds(6.89%)moretoexecutethanthoseimplementedwithsimplealwaysspinningalgorithms.Inanunloadedmachine,informingsynchronizationalgorithmsdegeneratetoalwaysspinningalgorithmswiththecostofsomeadditionalinstructionsinthecriticalpath.Theseadditionalinstructionsmightbecon-sideredresponsibleforthe-evenminimal-slowdown.Itisalsoworthpointingoutthatforlowornomultipro-gramming,theexecutiontimesofmicrobenchmarksusingimmediateblockingalgorithmsareworsethanthoseofmi-crobenchmarksusingalwaysspinningandinformingsyn-chronizationalgorithms.Inthepresenceoflow,ornocon-tentiontheunnecessaryreschedulingsandcontextswitchescausedbyimmediateblockingalgorithmshaveanegativeeffectonperformance.Theproblemispracticallyelim-
inatedwithinformingalgorithmsbecauseoftheoptimalspinvs.blockdecisionthesealgorithmsreach.However,asthemultiprogrammingdegreeraises,thehighestpossi-bleprocessorutilizationisofvitalimportanceandimme-diateblockingalgorithmsoutperformalwaysspinning(butnotinforming)ones.
TheresultsfromtheSplash-2experimentsindicatethattheperformancebenefitsattainedbyinformingsynchro-nizationalgorithmsarequitesoundinrealapplicationsaswell.Intheabsenceofmultiprogramming,informingsyn-chronizationalgorithmshavepracticallyindistinguishableperformancefromscheduler-obliviousones.However,inthepresenceofevenminimalmultiprogramming,inform-ingalgorithmsperformmuchbetter.TheperformanceofCholesky(lock-basedsynchronization)isupto36.8%(av-erage16.7%)higherwheninforminglockalgorithmsareusedinsteadofalwaysspinningones.Thecomparisonwithimmediateblockinglockalgorithmsyieldsmaximumandaverageimprovementof20%and11%respectively.Ra-diosity,whichalsouseslock-basedsynchronization,exe-cutesupto26.8%faster(average10.7%)ifinforminglockalgorithmsareusedinsteadofalwaysspinningonesandupto4.7timesfaster(average2.5timesfaster)incompari-sonwithimmediateblockingalgorithms.InLU,threadssynchronizeusingbarriers.Ifinformingbarriersareusedinsteadofalwaysspinningones,LUneedsupto6.5timeslessexecutiontime(average4.2timesless).Animmedi-ateblockingimplementationresultstoperformanceupto58.8%(average31.9%)worsethanthatofaninformingim-plementation.
Duetodifferencesinourplatformandimplementationframework,adirectone-to-onecomparisonofinformingsynchronizationalgorithmswiththescheduler-consciousalgorithmspresentedin[6]hasnotbeenpossible.However,anindirectqualitativecomparisonoftheresultsshowsthatinformingsynchronizationalgorithmsprovideimprove-mentsofsimilarorwidermarginthanthoseprovidedbyscheduler-conscioussynchronization.Thiscanbeattributedtothefactthatinformingalgorithmscontrolthreadschedul-ingmoreeffectively.Thescheduler-consciousalgorithmsofKontothanassisetal.trytoavoidpreemptionofthelock-holderattheexpenseofallthreadsinthesystem.Inordertoeliminatethechanceofmaliciousexploitationofthisfea-turethepreemptionisavoidedwithinshorttime-windowsonly.Thismakesscheduler-consciousalgorithmspronetoinefficienciesifthecriticalregionsarenotshort-enough.In-formingalgorithmsfavorthelockholderattheexpenseofotherthreadsofthesameapplicationonly.Thistechniqueavertsthedangerofmaliciousexploitation,sotheneedforatime-windowisraised.Moreover,informingalgorithmspreserve,asopposedtoscheduler-consciousones,thefair-nesscharacteristicsofsynchronizationalgorithms,reduc-ingthusthepossibilityofstarvation.Finally,theoptimality
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
ofthespinvs.blockdecisionprovidedbyinformingal-gorithmscanhavesignificantimpactontheperformance,especiallyforbarrieralgorithms.
[2]
6Conclusions
[3][4]
Inthispaperwehavepresentedagenerallyapplicablekernel-levelinfrastructure,whichcanbeusedtoimprovesignificantlytheperformanceofapplicationsinthepres-enceofmultiprogramming.Ourmechanismsallowappli-cationstousetheprocessortimetheyaregrantedbytheOSschedulermoreeffectively.Thisisachievedbygivingapplicationstheopportunitytoinformtheschedulerontherelativeimportanceoftheirthreads,byassigningthempri-oritiesfromuser-level.Theseprioritiesaretakenintoac-countbytheOSschedulerinordertodecidewhichspecificthreadofanapplicationwillexecuteonaprocessor.Theprioritycontrolmechanismsarenon-intrusive,inthesensethattheydonotallowapplicationstoimprovetheirperfor-manceattheexpenseofotherapplicationsinthesystem.Thekernelinfrastructureisgeneralanddoesnotdependonthenatureoftheapplication,thusitcanbeusedbymanydifferentalgorithms.Asaproofofconcept,wehaveimple-mentedsixdifferentinformingsynchronizationalgorithms,i.e.algorithmsthatusethecommonkernelmechanisms.Wehavepresentedindetailoneofthem.
Inordertoevaluatetheperformanceofinformingsyn-chronizationalgorithmsundermultiprogrammingwehaveexperimentedwithsyntheticmicrobenchmarksandcompu-tationalkernelsandapplicationsfromtheSplash-2suite.Informingsynchronizationalgorithmsperformsignificantlybetterthannon-informingonesinmultiprogrammedenvi-ronmentsandhavepracticallynegligibleoverheadintheabsenceofmultiprogramming.Moreover,thecomparativeevaluationofnon-informingalgorithmshasdrivenustotheconclusionthatsimple,immediateblockingsynchroniza-tionalgorithms,liketheTTASlockortheIncrementalBar-rier,performquitewellonsmall-scalemachines,eveninthepresenceofmultiprogramming.However,eventhesesimplealgorithmscanbenefitfromourkernelmechanisms.Webelievethattheusageofaninfrastructurewhichal-lowsnon-intrusiveuser-levelprioritycontrolisnotneces-sarilylimitedtosmall-scaleSMPs.Inanycase,theapplica-bilityofsuchmechanismsinlarger-scalesystemsrequiresthoroughinvestigation,giventhefactthatsuchsystemsgen-erallyhavedifferentarchitecturalcharacteristicsandmayposesignificantlyhigherthreadmigrationandremotemem-oryaccesscosts.
[5]
[6][7][8][9]
[10]
[11]
[12]
[13][14][15]
References
[1]T.E.Anderson,B.N.Bershad,E.D.Lazowska,andH.M.
Levy.Scheduleractivations:Effectivekernelsupportforthe
user-levelmanagementofparallelism.ACMTransactionsonComputerSystems,10(1):53–79,February1992.
C.D.Antonopoulos,D.S.Nikolopoulos,andT.S.Pap-atheodorou.Informingalgorithmsforefficientschedulingofsynchronizingthreadsonmultiprogrammedsmps.Tech-nicalReportHPCLAB-TR-220101,HighPerformanceIn-formationSystemsLaboratory,UniversityofPatras,Patras,Greece,January2001.
D.Black.Schedulingsupportforconcurrencyandparal-lelisminthemachoperatingsystem.IEEEComputer,23(5),1990.
A.Gupta,A.Tucker,andS.Urushibara.Theimpactofoper-atingsystemschedulingpoliciesandsynchronizationmeth-odsontheperformanceofparallelapplications.InProceed-ingsofthe1991ACMSIGMETRICSconferenceonMea-surementandmodelingofcomputersystems,pages120–132,1991.
A.Karlin,K.Li,M.Manasse,andS.Owicki.Empiricalstudiesofcompetitivespinningforashared-memorymulti-processor.InProceedingsofthe13thACMSymposiumonOperatingSystemPrinciples(SOSP’91),pages41–55,oct1991.
L.I.Kontothanassis,R.W.Wisniewski,andM.L.Scott.Scheduler-conscioussynchronization.ACMTransactionsonComputerSystems,15(1):3–40,February1997.
B.-H.LimandA.Agarwal.Waitingalgorithmsforsynchro-nizationinlarge-scalemultiprocessors.ACMTransactionsonComputerSystems,11(3):253–294,August1993.
B.D.Marsh,M.L.Scott,T.J.LeBlanc,andE.P.Markatos.First-classuser-levelthreads.In13thACMSymposiumonOperatingSystemsPrinciples,October1991.
J.M.Mellor-CrummeyandM.L.Scott.Algorithmsforscalablesynchronizationonshared-memorymultiproces-sors.ACMTransactionsonComputerSystems,9(1):21–65,1991.
M.M.MichaelandM.L.Scott.Relativeperformanceofpreemption-safelockingandnon-blockingsynchronizationonmultiprogrammedsharedmemorymultiprocessors.InProceedingsofthe11thInternationalParallelProcessingSymposium,April1997.
D.S.Nikolopoulos,C.D.Antonopoulos,I.E.Venetis,P.E.Hadjidoukas,E.D.Polychronopoulos,andT.S.Pa-patheodorou.Achievingmultiprogrammingscalabilityonparallelprogramsonintelsmpplatforms:Nanothreadinginthelinuxkernel.InProceedingsofthePARCO99Confer-ence,Delft,Netherlands,June1999.
C.Polychronopoulos,N.Bitar,andS.Kleiman.Nanoth-reads:Auser-levelthreadsarchitecture.TechnicalRe-portCSRD-1297,CSRD,UniversityofIllinoisatUrbana-Champaign,1993.SGI.TopicsInIrixProgramming,Chapter4.http://techpubs.sgi.com.
SunMicrosystems.Solaris8ReferenceManualCollection,Section3:ThreadsandRealtimeLibraryFunctions,March2000.http://docs.sun.com.
S.C.Woo,M.Ohara,E.Torrie,J.P.Singh,andA.Gupta.Thesplash-2programs:Characterizationandmethodologi-calconsiderations.InProceedingsofthe22ndAnnualInter-nationalSymposiumonComputerArchitecture(ISCA’95),pages24–36,June1995.
Proceedings of the 2001 International Conference on Parallel Processing (ICPP’01) 0190-3918/01 $10.00 © 2001 IEEE
因篇幅问题不能全部显示,请点此查看更多更全内容