2024年7月22日发(作者:宜方仪)
GPUs-GraphicsProcessingUnits
MinhTriDoDinh
-Dinh@
VertiefungsseminarArchitekturvonProzessoren,SS2008
InstituteofComputerScience,UniversityofInnsbruck
July7,2008
Thisores
theirarchitectureandunderlyingdesignprinciples,usingchipsfromNvidia’s”Geforce”seriesas
examples.
1Introduction
BeforewediveintothearchitecturaldetailsofsomeexampleGPUs,we’llhavealookatsomebasicconcepts
ofgraphicsprocessingand3Dgraphics,whichwillmakeiteasierforustounderstandthefunctionalityof
GPUs
1.1WhatisaGPU?
AGPU(GraphicsProcessingUnit)isessentiallyadedicatedhardwaredevicethatisresponsiblefortrans-
paper,wewillfocusonthe3Dgraphics,sincethatis
whatmodernGPUsaremainlydesignedfor.
1.2Theanatomyofa3Dscene
Figure1:A3Dscene
3Dscene:Acollectionof3Dobjectsandlights.
1
Figure2:Object,triangleandvertices
3Dobjects:Arbitraryobjects,nsarecomposedof
vertices.
Vertex:APointwithspatialcoordinatesandotherinformationsuchascolorandtexturecoordinates.
Figure3:Acubewithacheckerboardtexture
Texture:Animagethatismappedontothesurfaceofa3Dobject,whichcreatestheillusionofanobject
ticesofanobjectstoretheso-calledtexturecoordinates
(2-dimensionalvectors)thatspecifyhowatextureismappedontoanygivensurface.
Figure4:Texturecoordinatesofatrianglewithabricktexture
2
Inordertotranslatesucha3Dscenetoa2Dimage,thedatahastogothroughseveralstagesofa”Graphics
Pipeline”
1.3TheGraphicsPipeline
Figure5:The3DGraphicsPipeline
First,amongsomeotheroperations,wehavetotranslatethedatathatisprovidedbytheapplicationfrom
3Dto2D.
1.3.1GeometryStage
Thisstageisalsoreferredtoasthe”TransformandLighting”rtotranslatethescenefrom
3Dto2D,alltheobjectsofasceneneedtobetransformedtovariousspaces-eachwithitsowncoordinate
ransformationsareappliedona
vertex-to-vertexbasis.
MathematicalPrinciples
Apointin3Dspaceusuallyhas3coordinates,epusing3-dimensional
vectorsforthetransformationcalculations,werunintotheproblemthatdifferenttransformationsrequire
diff:translatingavertexrequiresadditionwithavectorwhilerotatingavertexrequires
multiplicationwitha3x3matrix).Wecircumventthisproblemsimplybyextendingthe3-dimensionalvector
byanothercoordinate(thew-coordinate),y,
everytransformationcanbeappliedbymultiplyingthevectorwithaspecific4x4matrix,makingcalculations
mucheasier.
Figure6:Transformationmatricesfortranslation,rotationandscaling
3
Lighting,theothermajorpartofthispipelinestageiscalculatedusingthenormalvectorsofthesurfaces
inationwiththepositionofthecameraandthepositionofthelightsource,onecan
computethelightingpropertiesofagivenvertex.
Figure7:Calculatinglighting
Fortransformation,westartoutinthemodelspacewhereeachobject(model)hasitsowncoordinate
system,whichfacilitatesgeometrictransformationssuchastranslation,rotationandscaling.
Afterthat,wemoveontotheworldspace,whereallobjectswithinthescenehaveaunifiedcoordinate
system.
Figure8:Worldspacecoordinates
Thenextstepisthetransformationintoviewspace,whichlocatesacameraintheworldspaceandthen
transformsthescene,suchthatthecameraisattheoriginoftheviewspace,lookingstraightintothe
andefineaviewvolume,theso-calledviewfrustrum,whichwillbeusedto
decidewhatactuallyisvisibleandneedstoberendered.
4
Figure9:Thecamera/eye,theviewfrustrumanditsclippingplanes
Afterthat,theverticesaretransformedintoclipspaceandassembledintoprimitives(trianglesorlines),
bjectsthatareoutsideofthefrustrumdon’tneedto
berenderedandcanbediscarded,objectsthatarepartiallyinsidethefrustrumneedtobeclipped(hence
thename),andnewverticeswithpropertextureandcolorcoordinatesneedtobecreated.
Aperspectivedivideisthenperformed,whichtransformsthefrustrumintoacubewithnormalized
coordinates(xandybetween-1and1,zbetween0and1)whiletheobjectsinsidethefrustrumarescaled
thisnormalizedcubefacilitatesclippingoperationsandsetsuptheprojectioninto2D
space(thecubesimplyneedstobe”flattened”).
Figure10:Transformingintoclipspace
Finally,wecanmoveintoscreenspacewherexandycoordinatesaretransformedforproper2Ddisplay
(inagivenwindow).(Notethatthez-coordinateofavertexisretainedforlaterdepthoperations)
Figure11:Fromviewspacetoscreenspace
5
Note,thatthetexturecoordinatesneedtobetransformedaswellandadditionallybesidesclipping,sur-
facesthataren’tvisible(ksideofacube)areremovedaswell(so-calledbackfaceculling).
Theresultisa2Dimageofthe3Dscene,andwecanmoveontothenextstage.
1.3.2RasterizationStage
needstotraversethe2Dimageandconvert
thedataintoanumberof”pixel-candidates”,so-calledfragments,whichmaylaterbecomepixelsofthe
fientisadatastructurethatcontainsattributessuchasposition,color,depth,texture
coordinates,eneratedbycheckingwhichpartofanygivenprimitiveintersectswithwhichpixel
gmentintersectswithaprimitive,butnotanyofitsvertices,theattributesofthat
fragmenthavetobeadditionallycalculatedbyinterpolatingtheattributesbetweenthevertices.
Figure12:Rasterizingatriangleandinterpolatingitscolorvalues
Afterthat,furtherstepscanbemadetoobtainthefiarecalculatedbycombining
textureswithotherattributessuchascolorandlightingorbycombiningafragmentwitheitheranother
translucentfragment(so-calledalphablending)oroptionalfog(anothergraphicaleffect).
Visibilitychecksareperformedsuchas:
•
•
•
•
Scissortest(checkingvisibilityagainstarectangularmask)
Stenciltest(similartoscissortest,onlyagainstarbitrarypixelmasksinabuffer)
Depthtest(comparingthez-coordinateoffragments,discardingthosewhicharefurtheraway)
Alphatest(checkingvisibilityagainsttranslucentfragments)
Additionalprocedureslikeanti-aliasingcanbeappliedbeforeweachievethefinalresult:anumberof
pixelsthatcanbewrittenintomemoryforlaterdisplay.
Thisconcludesourshorttourthroughthegraphicspipeline,whichhopefullygivesusabetterideaofwhat
kindoffunctionalitywillberequiredofaGPU.
6
2EvolutionoftheGPU
SomehistoricalkeypointsinthedevelopmentoftheGPU:
•Effortsforrealtimegraphicshavebeenmadeasearlyas1944(MIT’sproject”Whirlwind”)
•Inthe1980s,hardwaresimilartomodernGPUsbegantoshowupintheresearchcommunity(“Pixel-
Planes”,aaparallelsystemforrasterizingandtexture-mapping3Dgeometry
•Graphicchipsintheearly1980swereverylimitedintheirfunctionality
•Inthelate1980sandearly1990s,high-speed,general-purposemicroprocessorsbecamepopularfor
implementinghigh-endGPUs(nstruments’TMS340)
•1985Thefirstmass-marketgraphicsacceleratorwasincludedintheCommodoreAmiga
•1991S3introducedthefirstsinglechip2D-accelerator,theS386C911
•1995Nvidiareleasesoneofthefirst3Daccelerators,theNV1
•1999Nvidia’sGeforce256isthefirstGPUtoimplementTransformandLightinginHardware
•2001NvidiaimplementsthefirstprogrammableshaderunitswiththeGeforce3
•2005ATIdevelopsthefirstGPUwithunifiedshaderarchitecturewiththeATIXenosfortheXBox
360
•2006NvidialaunchesthefirstunifiedshaderGPUforthePCwiththeGeforce8800
7
3FromTheorytoPractice-theGeforce6800
3.1Overview
ModernGPUscloselyfollowthelayoutofthegraphicspipelinedescribedinthefividia’s
Geforce6800asanexamplewewillhaveacloserlookatthearchitectureofmoderndayGPUs.
Sincebeingfoundedin1993,thecompanyNVIDIAhasbecomeoneofthebiggestmanufacturersofGPUs
(besidesATI),havingreleasedimportantchipssuchastheGeforce256,andtheGeforce3.
Launchedin2004,theGeforce6800belongstotheGeforce6series,Nvidia’ssixthgenerationofgraphics
chipsetsandthefourthgenerationthatfeaturedprogrammability(moreonthatlater).
ThefollowingimageshowsaschematicviewoftheGeforce6800anditsfunctionalunits.
Figure13:SchematicviewoftheGeforce6800
Youcanalreadyseehoweachofthefunctionalunitscorrespondtothestagesofthegraphicspipeline.
Westartwithsixparallelvertexprocessorsthatreceivedatafromthehost(theCPU)andperformoper-
ationssuchastransformationandlighting.
8
Next,theoutputgoesintothetrianglesetupstagewhichtakescareofprimitiveassembly,cullingand
clipping,orce6800hasanadditional
Z-cullunitwhichallowstoperformanearlyfragmentvisibilitycheckbasedondepth,furtherimprovingthe
efficiency.
Wethenmoveontothesixteenfragmentprocessorswhichoperatein4parallelunitsandcomputesthe
outputcolorsofeachfragment.
Thefragmentcrossbarisalinkingelementthatisbasicallyresponsiblefordirectingoutputpixelstoany
availablepixelengine(alsocalledROP,shortforRasterOperator),thusavoidingpipelinestalls.
The16pixelenginesarethefinalstageofprocessing,andperformoperationssuchasalphablending,
depthtests,etc.,beforedeliveringthefinalpixeltotheframebuffer.
3.2InDetail
Figure14:AmoredetailedviewoftheGeforce6800
WhilemostpartsoftheGPUarefixedfunctionunits,thevertexandfragmentprocessorsoftheGeforce
6800offerprogrammabilitywhichwasfirstintroducedtothegeforcechipsetlinewiththegeforce3(2001).
We’llhaveamoredetailedlookattheunitsinthefollowingsections.
9
3.2.1VertexProcessor
Figure15:Avertexprocessor
Thevertexprocessorsaretheprogrammableunitsresponsibleforallthevertextransformationsandat-
eratewith4-dimensionaldatavectorscorrespondingwiththeaforementioned
homogeneouscoordinatesofavertex,using32bitspercoordinate(hencethe128bitsofaregister).Instruc-
tionsare123bitslongandarestoredintheInstructionRAM.
Thedatapathofavertexprocessorconsistsof:
•Amultiply-addunitfor4-dimensionalvectors
•Ascalarspecialfunctionunit
•Atextureunit
Instructionset:
Somenotableinstructionsforthevertexprocessorinclude:
dp4dst,src0,src1
expdst,src
dstdest,src0,src1
nrmdst,src
rsqdst,src
Computesthefour-componentdotproductofthesourceregisters
Providesexponential2
x
Calculatesadistancevector
Normalizea3Dvector
Computesthereciprocalsquareroot(positiveonly)ofthesource
scalar
Registersinthevertexprocessorinstructionscanbemodified(withfewexceptions):
•
•
•
•
Negatetheregistervalue
Taketheabsolutevalueoftheregister
Swizzling(copyanysourceregistercomponenttoanytemporaryregistercomponent)
Maskdestinationregistercomponents
10
Othertechnicaldetails:
•
•
•
•
•
VertexprocessorsareMIMDunits(MultipleInstructionMultipleData)
TheyuseVLIW(VeryLongInstructionWords)
Theyoperatewith32-bitfloatingpointprecision
Eachvertexprocessorrunsupto3threadstohidelatency
Eachvertexprocessorcanperformafour-wideMAD(Multiply-Add)andaspecialfunctioninone
cycle
3.2.2FragmentProcessor
Figure16:Afragmentprocessor
egroupedto4biggerunitswhichoperatesimulta-
neouslyon4fragmentseach(aso-calledquad).Theycantakeposition,color,depth,fogaswellasother
arbitrary4-dimensionalattributesasinput.
Thedatapathconsistsof:
•AnInterpolationblockforattributes
•2vectormath(shader)units,eachwithslightlydifferentfunctionality
•Afragmenttextureunit
Superscalarity:
Afragmentprocessorworkswith4-vectors(vector-orientedinstructionset),wheresometimescomponentsof
thevectorneedbetreatedseperately(,alpha).Thus,thefragmentprocessorsupportsco-issueing
ofthedata,whichmeanssplittingthevectorinto2partsandexecutingdifferentoperationsontheminthe
orts3-1and2-2splitting(2-2co-issuewasn’tpossibleearlier).
Additionally,italsofeaturesdualissue,whichmeansexecutingdifferentoperationsonthe2vectormath
unitsinthesameclock.
TextureUnit:
Thetextureunitisafloating-pointtextureprocessorwhichfetchesandfin-
nectedtoalevel1texturecache(whichstorespartsofthetexturesthatareused).
11
Shaderunits1and2:
Eachshaderunitislimitedinitsabilities,offeringcompletefunctionalitywhenusedtogether.
Figure17:BlockdiagramofShaderUnit1and2
ShaderUnit1:
Green:Acrossbarwhichdistributestheinputcomingeiterfromtherasterizerorfromtheloopback
Red:Interpolators
Yellow:Aspecialfunctionunit(forfunctionssuchasReciprocal,ReciprocalSquareRoot,etc.)
Cyan:MULchannels
Orange:Aunitfortextureoperations(notthefragmenttextureunit)
Theshaderunitcanperform2operationsperclock:
AMULona3-dimensionalvectorandaspecialfunction,aspecialfunctionandatextureoperation,or2
MULs.
TheoutputofthespecialfunctionunitcangointotheMULchannels.
ThetexturegetsinputfromtheMULunitanddoesLOD(LevelOfDetail)calculations,beforepassing
gmenttextureunitthenperformstheactualsampling
andwritesthedataintoregistersforthesecondshaderunit.
Theshaderunitcansimplypassdataaswell.
ShaderUnit2:
Red:Acrossbar
Cyan:4MULchannels
Gray:4ADDchannels
Yellow:1specialfunctionunit
Thecrossbarsplitstheinputonto5channels(4components,1channelstaysfree).
TheADDunitsareadditionallyconnected,allowingadvancedoperationssuchasadotproductinoneclock.
Again,theshaderu
specialfunctionisused,theMADunitcanperformupto2operationsfromthislist:MUL,ADD,MAD,
12
DP,oranyotherinstructionbasedontheseoperations.
Instructionset:
Somenotableinstructionsforthevertexprocessorinclude:
cmpdst,src0,src1,src2
dsxdst,src
dsydst,src
sincosdst.{x|y|xy},src0.{x|y|z|w}
texlddst,src0,src1
Choosesrc1ifsrc0>=ise,parison
isdoneperchannel
Computetherateofchangeintherendertarget’sx-direction
Computetherateofchangeintherendertarget’sy-direction
Computessineandcosine,inradians
Sampleatextureataparticularsampler,usingprovidedtexture
coordinates
Registersinthefragmentprocessorinstructionscanbemodified(withfewexceptions):
•Negatetheregistervalue
•Taketheabsolutevalueoftheregister
•Maskdestinationregistercomponents
Othertechnicaldetails:
•Thefragmentprocessorscanperformoperationswithin16or32floatingpointprecision(
unitusesonly16bitprecisionforitscalculationssincethatissufficient)
•ThequadsoperateasSIMDunits
•TheyuseVLIW
•Theyrunupto100softhreadstohidetexturefetchlatency(˜256perquad)
•Afragmentprocessorcanperformupto8operationspercycle/4mathoperationsifthere’satexture
fetchinshader1
Figure18:Possibleoperationspercycle
•Thefragmentprocessorshavea2leveltexturecache
•Thefogunitcanperformfogblendingonthefiplemented
withfixedpointprecisionsincethat’ssufficientforfogandsavesperformance.
Theequation:out=FogColor*fogFraction+SrcColor*(1-fogFraction)
13
•There’ssupportformultiplerendertargets,thepixelprocessorcanoutputtouptofourseperate
buffers(4x4values,color+depth)
3.2.3PixelEngine
Figure19:Apixelengine
Lastinthepipelinearethe16pixelengines(rasteroperators).Eachpixelengineconnectstoaspecific
helosslesscoloranddepthcompression,thedepthandcolorunits
performdepth,colorandstenciloperationsbeforewritingthefitivatedthepixelengines
alsoperformmultisampleantialiasing.
3.2.4Memory
From“GPUGems2,Chapter30:TheGeForce6SeriesGPUArchitecture”:
“Thememorysystemispartitionedintouptofourindependentmemorypartitions,each
withitsowndynamicrandom-accessmemories(DRAMs).GPUsusestandardDRAMmodules
ratherthancustomRAMtechnologiestotakeadvantageofmarketeconomiesandtherebyreduce
smaller,independentmemorypartitionsallowsthememorysubsystemtooperate
efficderedsurfaces
arestoredintheDRAMs,whiletexturesandinputdatacanbestoredintheDRAMsorin
rindependentmemorypartitionsgivetheGPUawide(256bits),
flexiblememorysubsystem,allowingforstreamingofrelativelysmall(32-byte)memoryaccesses
atnearthe35GB/secphysicallimit.”
14
3.3Performance
•425MHzinternalgraphicsclock
•550MHzmemoryclock
•256-MBmemorysize
•35.2GByte/secondmemorybandwidth
•600millionvertices/second
•6.4billiontexels/second
•12.8billionpixels/second,renderingz/stencil-only(usefulforshadowvolumesandshadowbuffers)
•6four-widefp32vectorMADsperclockcycleinthevertexshader,plusonescalarmultifunction
operation(acomplexmathoperation,suchasasineorreciprocalsquareroot)
•16four-widefp32vectorMADsperclockcycleinthefragmentprocessor,plus16four-widefp32
multipliesperclockcycle
•64pixelsperclockcycleearlyz-cull(rejectrate)
•120+Gflopspeak(equaltosix5-GHzPentium4processors)
•Upto120Wenergyconsumption(thecardhastwoadditionalpowerconnectors,thepowersources
arerecommendedtobenolessthan480W)
15
4ComputationalPrinciples
StreamProcessing:
TypicalCPUs(thevonNeumannarchitecture)suffe
verysensitivetosuchbottlenecks,andthereforeneedadifferentarchitecture,theyareessentiallyspecial
purposestreamprocessors.
Astrmisasetofdata
amprocessors,everykerneltakesoneormorestreamsasinputand
outputsoneormorestreams,whileitexecutesitsoperationsoneverysingleelementoftheinputstreams.
Instreamprocessorsyoucanachieveseverallevelsofparallelism:
•Instructionlevelparallelism:kernelsperformhundredsofinstructionsoneverystreamelement,you
achieveparallelismbyperformingindependentinstructionsinparallel
•Datalevelparallelism:kernelsperformthesameinstructionsoneachstreamelement,youachieve
parallelismbyperformingoneinstructiononmanystreamelementsatatime
•Tasklevelparallelism:Havemultiplestreamprocessorsdividetheworkfromonekernel
Streamprocessorsdonotusecachingthesamewaytraditionalprocessorsdosincetheinputdatasetsare
usuallymuchlargerthanmostcachesandthedataisbarelyreused-withGPUsforexamplethedatais
usuallyrenderedandthendiscarded.
WeknowGPUshavetoworkwithlargeamountsofdata,thecomputationsaresimplerbuttheyneed
tobefastandparallel,soitbecomesclearthatthestreamprocessorarchitectureisverywellsuitedforGPUs.
Continuingtheseideas,GPUsemployfollowingstrategiestoincreaseoutput:
Pipelining:Pipeliningdescribestheideaofbreakingdownajobintomultiplecomponentsthateachperform
epipelined,whichmeansthatinsteadofperformingcompleteprocessingofapixel
beforemovingontothenext,youfillthepipelinelikeanassemblylinewhereeachcomponentperformsa
eprocessingapixelmaytakemultipleclock
cycles,youstillachieveanoutputofonepixelperclocksinceyoufillupthewholepipe.
Parallelism:Duetothenatureofthedata-parallelismcanbeappliedonaper-vertexorper-pixelbasis
-andthetypeofprocessing(highlyrepetitive)GPUsareverysuitableforparallelism,youcouldhavean
unlimitedamountofpipelinesnexttoeachother,aslongastheCPUisabletokeepthembusy.
OtherGPUcharacteristics:
•GPUscanaffordlargeamountsoffloatingpointcomputationalpowersincetheyhavelowercontrol
overhead
•Theyusededicatedfunctionalunitsforspecializedtaskstoincreasespeeds
•GPUmemorystruggleswithbandwidthlimitations,andthereforeaimsformaximumbandwidthusage,
employingstrategieslikedatacompression,multiplethreadstocopewithlatency,schedulingofDRAM
cyclestominimizeidledata-bustime,etc.
•Cachesaredesignedtosupporteffectivestreamingwithlocalreuseofdata,ratherthanimplementing
acachethatachieves99%hitrates(whichisn’tfeasible),GPUcachedesignsassumea90%hitrate
withmanymissesinflight
•GPUshavemanydifferentperformanceregimesallwithdifferentcharacteristicsandneedtobede-
signedaccordingly
16
4.1TheGeforce6800asageneralprocessor
YoucanseetheGeforce6800asageneralprocessorwithalotoffloating-pointhorsepowerandhighmemory
bandwidththatcanbeusedforotherapplicationsaswell.
Figure20:AgeneralviewoftheGeforce6800architecture
LookingattheGPUthatway,weget:
•2seriallyrunningprogrammableblockswithfp32capability.
•TheRasterizercanbeseenasaunitthatexpandsthedataintointerpolatedvalues(fromonedata-
”point”tomultiple”fragments”).
•WithMRT(MultipleRenderTargets),thefragmentprocessorcanoutputupto16scalarfloating-point
valuesatatime.
•Severalpossibilitiestocontrolthedataflowbyusingthevisibilitychecksofthepixelenginesorthe
Z-cullunit
17
5Thenextstep:theGeforce8800
AftertheGeforce7serieswhichwasacontinuationoftheGeforce6800architecture,Nvidiaintroducedthe
bythedesiretoincreaseperformance,improveimagequalityandfacilitate
programming,theGeforce8800presentedasignificantevolutionofpastdesigns:aunifiedshaderarchitec-
ture(Note,thatATIalreadyusedthisarchitecturein2005withtheXBOX360GPU).
Figure21:Fromdedicatedtounifiedarchitecture
Figure22:AschematicviewoftheGeforce8800
TheunifiedshaderarchitectureoftheGeforce8800essentiallyboilsdowntothefactthatallthedifferent
shaderstagesbecomeonesinglestagethatcanhandleallthedifferentshaders.
AsyoucanseeinFigure22,insteadofdifferentdedicatedunitswenowhaveasinglestreamingprocessor
familiarunitssuchastherasteroperators(blue,atthebottom)andthetrianglesetup,
stheseunitswenowhaveseveralmanagingunitsthatprepareand
managethedataasitflowsintheloop(vertex,geometryandpixelthreadissue,inputassemblerandthread
processor).
18
Figure23:Thestreamingprocessorarray
xtureprocessorclusterin
turnconsistsof2streamingmultiprocessorsand1texturepipe.
Astreamieamingpro-
cessorsworkwith32-bitscalardata,basedontheideathatshaderprogramsarebecomingmoreandmore
scalar,makingavectorarchitecturemoreineffiedrivenbyahigh-speedclockthatisseperate
ltiprocessorcan
have768hardwarescheduledthreads,groupedtogetherto24SIMD”warps”(Awarpisagroupofthreads).
Thetexturepipeconsistsof4textureaddressingand8texturefiormstextureprefetching
andfilteringwithoutconsumingALUresources,furtherincreasingefficency.
Itmple,theold
problemofconstantlychangingworkloadandoneshaderstagebecomingaprocessingbottleneckissolved
sincetheunitscanadaptdynamically,nowthattheyareunified.
Figure24:Workloadbalancingwithbotharchitectures
Withasingleinstructionsetandthesupportoffp32throughoutthewholepipeline,aswellasthesupport
ofnewdatatypes(integercalculations),programmingtheGPUnowbecomeseasieraswell.
19
6GeneralPurposeProgrammingontheGPU-anexample
WeusethebitonicmergesortalgorithmasanexampleforefficientlyimplementingalgorithmsonaGPU.
Bitonicmergesort:
Bitonicmergesortworksbyrepeatedlybuildingbitoniclistsoutofasetofelementsandsortingthem.A
bitoniclistisaconcatenationoftwomonotoniclists,oneascendingandonedescending.
E.g.:
ListA=(3,4,7,8)
ListB=(6,5,2,1)
ListAB=(3,4,7,8,6,5,2,1)isabitoniclist
ListBA=(6,5,2,1,3,4,7,8)isalsoabitoniclist
Bitoniclistshaveaspecialpropertythatisusedinbitonicmergesort:Supposeyouhaveabitoniclistof
rearrangethelistsothatyougettwohalveswithnelementswhereeachelement(i)of
thefirsthalfislessthanorequaltoeachcorrespondingelement(i+n)inthesecondhalf(orgreaterthanor
equal,ifthelistdescendsfirstandthenascends)ppensbycom-
ocedureiscalledabitonicmerge.
Bitonicmergesortworksbyrecursivelycreatingandmergingbitonicliststhatincreaseintheirsizeuntil
25illustratestheprocess:
Figure25:Thedifferentstagesofthealgorithm
Thesortingprw
sultsinbitonicmergesorthavingacomplexityof
O(nlog
2
(n)+log(n))whichisworsethanquicksort,butthealgorithmhasnoworst-casescenario(where
quicksortreachesO(n
2
).
theoperationscanbeperformedinparalleland
thelengthstaysconstant,nimplementingthisalgorithmon
theGPU,wewanttomakeuseofasmanyresourcesaspossible(bothinparallelaswellasverticallyalong
20
thepipeline),especiallyconsideringthattheGPUhasshortcomingsaswell,suchasthelackofsupport
mple,simplylettingthefragmentprocessorstagehandleallthe
calculationsmightwork,ble
solutionlookslikethis:
Inthisalgorithm,wehavegroupsofelements(fragments)thathavethesamesortingconditions,while
edrawavertexquadovertwoadjacentgroupsand
setappropriateflagsateachcorner,
example,ifwesettheleftcornersto-1andtherightcornersto+1,wecancheckwhereafragmentbelongs
tobysimplylookingatitssign(theinterpolationprocesstakescareofthat).
Figure26:Usingvertexflagsandtheinterpolatortodeterminecompareoperations
Next,weneedtodeterminewhichcompareoperationtouseandweneedtolocatethepartneritemto
nagainbeaccomplishedbyusingtheflgthecompareoperationtoless-than
andmultiplyingwiththeflagvalueimplicitlyflipstheoperationtogreater-equalhalfwayacrossthequad.
Locatingthepartneritemhappensbymultiplyingthesignoftheflagwithanabsolutevaluethatspecifies
thedistancebetweentheitems.
Inordertosortelementsofanarray,westorethemina2Dtexture.
Eachrowisatend
thequadsovertherowsofthe2Dtextureandusetheinterpolation,wecanmodulatethecomparisonsothe
y,pairsofrowsbecomebitonic
sequencesagainwhichcanbesortedinthesamewaywesortedthecolumnsofthesinglerows,simplyby
transposingthequads.
Asafinaloptimizationwereducetexturefetchingbypackingtwoneighbouringkeypairsintoonefrag-
ment,sincetheshaderoperateson4-vectors.
Performancecomparison:
std:sort:16-BitData,
Pentium43.0GHz
NFullSorted
Sorts/SecKeys/Sec
2
25682.55.4M
2
51220.65.4M
2
10244.75.0M
BitonicMergeSort:16-BitFloatData,
NVIDIAGeforce6800Ultra
NPassesFullSorted
Sorts/SecKey/Sec
2
25612090.076.1M
2
51215318.34.8M
2
10241903.63.8M
21
GLSL(OpenGLShadingLanguage)codesample,implementingthecombinedpasses0and1forrow-wise
sortingofthebitonicmergesort:
uniformsampler2DPackedData:
//contentsofthetexcoorddata
#defineOwnPosgl_TexCoord[0].xy
#defineSearchDirgl_TexCoord[0].z
#defineCompOpgl_TexCoord[0].w
#defineDistancegl_TexCoord[1].x
#defineStridegl_TexCoord[1].y
#defineHeightgl_TexCoord[1].z
#defineHalfStrideMHalfgl_TexCoord[1].w
voidmain(void)
{
//getself
vec4self=texture2D(PackedData,OwnPos);
//restoresignofsearchdirectionandassemblevectortopartner
vec2adr=vec2((SearchDir<0.0)?-Distance:Distance,0.0);
//getthepartner
vec4partner=texture2D(PackedData,OwnPos+adr);
//switchascending/descendingsortforeveryotherrow
//bymodifyingcomparisonflag
floatcompare=CompOp*-(mod(floor(gl_TexCoord[0].y*Height),Stride)-HalfStrideMHalf);
//xandyarethekeysofthetwoitems
//-->multiplywithcomparisonflag
vec4keys=compare*vec4(self.x,self.y,partner.x,partner.y);
//comparethekeysandstoreaccordingly
//zandwaretheindices
//-->justcopythemaccordingly
vec4result;
=(keys.x =(keys.y //dopass0 compare*=adr.x; gl_FragColor=(result.x*compare } 22 7Currentandfuturedevelopments Nvidia’scurrenttopofthelinemodelofgraphicscardsistheGeforceGTX280(GTX200series),an evolutionoftheunifiedshaderarchitectureoftheGeforce8800,sportingalmostdoubletheshadercount (from128to240)nchdatewasthe17thofJune2008. ATI(nowmergedintoAMD)followedin2007withitsfirstunifiedshaderGPUforthePC(Radeon HD2900XT),renttop ofthelinemodelistheRadeonHD3870X2(whichactuallysports2GPUsononecard)whichwasreleased inJanuary2008. ATI/AMDaresoontofollowupwithananswertoNvidia’sGTX280:theRadeonHD4870(slatedfor somewherearoundJuly2008). Withtheadventoftheunifiedshaderarchitecture,thetopicofgeneral-purposecomputingonaGPUhas ,GPUshavemadetheirwayintonon-graphicsfieldsasvariedas audiosignalprocessingandweatherforecasting. BothATI/AMDandNvidiahavemadeeff releasedCTM(CloseToMetal),ewritingthesoftware,CTM’s commercialsuccessorAMDStreamSDKwasreleasedinDecember2007,nowprovidingadditionalhighlevel toolsforgeneral-purposeaccesstoAMDgraphicshardware. NvidiainitiallyreleasedtheCUDA(ComputeUnifiedDeviceArchitecture)SDKinFebruary2007,aC lauary2008,Nvidia boughtAgeiaandtheirPhysXengine(aproprietaryrealtimephysicsenginemiddlewareSDK)andinte- grateditintotheirCUDAframework. tuallyscalewell beyondMoore’slaw,charapiddevelopmentwecan certainlyexpecttoseequitesomeinterestingthingstocomeinthisfieldofprocessing. References [1]WikipediaentryonGPUs /wiki/GPU [2]KeesHuizing,Han-WeiShen:“TheGraphicsRenderingPipeline” / ~ keesh/ow/2IV40/ [3]CyrilZeller:“IntroductiontotheHardwareGraphicsPipeline”,PresentationatACMSIGGRAPH 2005 /developer/presentations/2005/I3D/I3D_05_ [4]ExtremeTech3DPipelineTutorial /article2/0,1697,9722, [5]AshuRege:“Introductionto3DGraphicsforGames” /docs/IO/11278/ [6]DirectXDeveloperCenter:“TheDirect3DTransformationPipeline” /en-us/library/bb206260(VS.85).aspx [7]MarkColbert:“GPUArchitecture&CG” /gpuseminar/ [8]GPUGems2,Chapter30:“TheGeForce6SeriesGPUArchitecture” /developer/GPU_Gems_2/GPU_Gems2_ [9]IEEEMicro,Volume25,Issue2(March2005):“TheGeForce6800” /?id=1069760 [10]:“NV40-TechnikimDetail” /artikel/nv40_pipeline/ 23 [11]:“NVIDIAGeForce6800Ultra(NV40)” /articles2/gffx/ [12]AustinRobison,AbeWinter:“AnOverviewofGraphicsProcessingHardware” / ~ robison/src/gpu_ [13]JohnMontrym,HenryMoreton:“NVIDIAGeForce6800”,HotChips16 /archives/hc16/2_Mon/13_HC16_Sess3_Pres1_ [14]AjitDatar,ApurvaPadhye:“GraphicsProcessingUnitArchitecture” / ~ data0003/Talks/ [15]SvenSchenk:“EineEinfuehrungindieArchitekturmodernerGraphikprozessoren” /Lehre/Seminar0506/ [16]ThomasScottCrow:“EvolutionoftheGraphicalProcessingUnit” / ~ fredh/papers/thesis/023-crow/ [17]DirectXDeveloperCenter:“AsmShaderReference” /en-us/library/bb219840(VS.85).aspx [18]ErikLindholm,StuartOberman:“NVIDIAGeForce8800GPU” /archives/hc19/2_Mon/HC19.02/ [19]:“SayHelloToDirectX10,Or128ALUsInAction:NVIDIAGeForce8800GTX(G80)” /articles2/video/ [20]RichardHough,RichardYu:“GPUArchitecture” /courses/ece685/slides/ [21]TechnicalBrief:“NVIDIAGeForce8800GPUArchitectureOverview” /object/IO_ [22]GPUGems2,Chapter46:“ImprovedGPUSorting” [23]TimPurcell:“SortingandSearching”,SIGGRAPH2005GPGPUCOURSE /s2005/slides/ [24]PeterKipfer,MarkSegal,RuedigerWestermann:“UberFlow:AGPU-BasedParticleEngine” /previous/www_2004/Presentations/ [25]WikipediaentryonNvidia /wiki/Nvidia_Corporation [26]WikipediaentryonATI /wiki/ATI_Technologies_Inc. [27]WikipediaentryonCUDA /wiki/CUDA [28]WikipediaentryonCTM /wiki/Close_to_Metal [29]WilliamMark,HenryMoreton:“3DGraphicsArchitectureTutorial” /users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_ 24 2024年7月22日发(作者:宜方仪)
GPUs-GraphicsProcessingUnits
MinhTriDoDinh
-Dinh@
VertiefungsseminarArchitekturvonProzessoren,SS2008
InstituteofComputerScience,UniversityofInnsbruck
July7,2008
Thisores
theirarchitectureandunderlyingdesignprinciples,usingchipsfromNvidia’s”Geforce”seriesas
examples.
1Introduction
BeforewediveintothearchitecturaldetailsofsomeexampleGPUs,we’llhavealookatsomebasicconcepts
ofgraphicsprocessingand3Dgraphics,whichwillmakeiteasierforustounderstandthefunctionalityof
GPUs
1.1WhatisaGPU?
AGPU(GraphicsProcessingUnit)isessentiallyadedicatedhardwaredevicethatisresponsiblefortrans-
paper,wewillfocusonthe3Dgraphics,sincethatis
whatmodernGPUsaremainlydesignedfor.
1.2Theanatomyofa3Dscene
Figure1:A3Dscene
3Dscene:Acollectionof3Dobjectsandlights.
1
Figure2:Object,triangleandvertices
3Dobjects:Arbitraryobjects,nsarecomposedof
vertices.
Vertex:APointwithspatialcoordinatesandotherinformationsuchascolorandtexturecoordinates.
Figure3:Acubewithacheckerboardtexture
Texture:Animagethatismappedontothesurfaceofa3Dobject,whichcreatestheillusionofanobject
ticesofanobjectstoretheso-calledtexturecoordinates
(2-dimensionalvectors)thatspecifyhowatextureismappedontoanygivensurface.
Figure4:Texturecoordinatesofatrianglewithabricktexture
2
Inordertotranslatesucha3Dscenetoa2Dimage,thedatahastogothroughseveralstagesofa”Graphics
Pipeline”
1.3TheGraphicsPipeline
Figure5:The3DGraphicsPipeline
First,amongsomeotheroperations,wehavetotranslatethedatathatisprovidedbytheapplicationfrom
3Dto2D.
1.3.1GeometryStage
Thisstageisalsoreferredtoasthe”TransformandLighting”rtotranslatethescenefrom
3Dto2D,alltheobjectsofasceneneedtobetransformedtovariousspaces-eachwithitsowncoordinate
ransformationsareappliedona
vertex-to-vertexbasis.
MathematicalPrinciples
Apointin3Dspaceusuallyhas3coordinates,epusing3-dimensional
vectorsforthetransformationcalculations,werunintotheproblemthatdifferenttransformationsrequire
diff:translatingavertexrequiresadditionwithavectorwhilerotatingavertexrequires
multiplicationwitha3x3matrix).Wecircumventthisproblemsimplybyextendingthe3-dimensionalvector
byanothercoordinate(thew-coordinate),y,
everytransformationcanbeappliedbymultiplyingthevectorwithaspecific4x4matrix,makingcalculations
mucheasier.
Figure6:Transformationmatricesfortranslation,rotationandscaling
3
Lighting,theothermajorpartofthispipelinestageiscalculatedusingthenormalvectorsofthesurfaces
inationwiththepositionofthecameraandthepositionofthelightsource,onecan
computethelightingpropertiesofagivenvertex.
Figure7:Calculatinglighting
Fortransformation,westartoutinthemodelspacewhereeachobject(model)hasitsowncoordinate
system,whichfacilitatesgeometrictransformationssuchastranslation,rotationandscaling.
Afterthat,wemoveontotheworldspace,whereallobjectswithinthescenehaveaunifiedcoordinate
system.
Figure8:Worldspacecoordinates
Thenextstepisthetransformationintoviewspace,whichlocatesacameraintheworldspaceandthen
transformsthescene,suchthatthecameraisattheoriginoftheviewspace,lookingstraightintothe
andefineaviewvolume,theso-calledviewfrustrum,whichwillbeusedto
decidewhatactuallyisvisibleandneedstoberendered.
4
Figure9:Thecamera/eye,theviewfrustrumanditsclippingplanes
Afterthat,theverticesaretransformedintoclipspaceandassembledintoprimitives(trianglesorlines),
bjectsthatareoutsideofthefrustrumdon’tneedto
berenderedandcanbediscarded,objectsthatarepartiallyinsidethefrustrumneedtobeclipped(hence
thename),andnewverticeswithpropertextureandcolorcoordinatesneedtobecreated.
Aperspectivedivideisthenperformed,whichtransformsthefrustrumintoacubewithnormalized
coordinates(xandybetween-1and1,zbetween0and1)whiletheobjectsinsidethefrustrumarescaled
thisnormalizedcubefacilitatesclippingoperationsandsetsuptheprojectioninto2D
space(thecubesimplyneedstobe”flattened”).
Figure10:Transformingintoclipspace
Finally,wecanmoveintoscreenspacewherexandycoordinatesaretransformedforproper2Ddisplay
(inagivenwindow).(Notethatthez-coordinateofavertexisretainedforlaterdepthoperations)
Figure11:Fromviewspacetoscreenspace
5
Note,thatthetexturecoordinatesneedtobetransformedaswellandadditionallybesidesclipping,sur-
facesthataren’tvisible(ksideofacube)areremovedaswell(so-calledbackfaceculling).
Theresultisa2Dimageofthe3Dscene,andwecanmoveontothenextstage.
1.3.2RasterizationStage
needstotraversethe2Dimageandconvert
thedataintoanumberof”pixel-candidates”,so-calledfragments,whichmaylaterbecomepixelsofthe
fientisadatastructurethatcontainsattributessuchasposition,color,depth,texture
coordinates,eneratedbycheckingwhichpartofanygivenprimitiveintersectswithwhichpixel
gmentintersectswithaprimitive,butnotanyofitsvertices,theattributesofthat
fragmenthavetobeadditionallycalculatedbyinterpolatingtheattributesbetweenthevertices.
Figure12:Rasterizingatriangleandinterpolatingitscolorvalues
Afterthat,furtherstepscanbemadetoobtainthefiarecalculatedbycombining
textureswithotherattributessuchascolorandlightingorbycombiningafragmentwitheitheranother
translucentfragment(so-calledalphablending)oroptionalfog(anothergraphicaleffect).
Visibilitychecksareperformedsuchas:
•
•
•
•
Scissortest(checkingvisibilityagainstarectangularmask)
Stenciltest(similartoscissortest,onlyagainstarbitrarypixelmasksinabuffer)
Depthtest(comparingthez-coordinateoffragments,discardingthosewhicharefurtheraway)
Alphatest(checkingvisibilityagainsttranslucentfragments)
Additionalprocedureslikeanti-aliasingcanbeappliedbeforeweachievethefinalresult:anumberof
pixelsthatcanbewrittenintomemoryforlaterdisplay.
Thisconcludesourshorttourthroughthegraphicspipeline,whichhopefullygivesusabetterideaofwhat
kindoffunctionalitywillberequiredofaGPU.
6
2EvolutionoftheGPU
SomehistoricalkeypointsinthedevelopmentoftheGPU:
•Effortsforrealtimegraphicshavebeenmadeasearlyas1944(MIT’sproject”Whirlwind”)
•Inthe1980s,hardwaresimilartomodernGPUsbegantoshowupintheresearchcommunity(“Pixel-
Planes”,aaparallelsystemforrasterizingandtexture-mapping3Dgeometry
•Graphicchipsintheearly1980swereverylimitedintheirfunctionality
•Inthelate1980sandearly1990s,high-speed,general-purposemicroprocessorsbecamepopularfor
implementinghigh-endGPUs(nstruments’TMS340)
•1985Thefirstmass-marketgraphicsacceleratorwasincludedintheCommodoreAmiga
•1991S3introducedthefirstsinglechip2D-accelerator,theS386C911
•1995Nvidiareleasesoneofthefirst3Daccelerators,theNV1
•1999Nvidia’sGeforce256isthefirstGPUtoimplementTransformandLightinginHardware
•2001NvidiaimplementsthefirstprogrammableshaderunitswiththeGeforce3
•2005ATIdevelopsthefirstGPUwithunifiedshaderarchitecturewiththeATIXenosfortheXBox
360
•2006NvidialaunchesthefirstunifiedshaderGPUforthePCwiththeGeforce8800
7
3FromTheorytoPractice-theGeforce6800
3.1Overview
ModernGPUscloselyfollowthelayoutofthegraphicspipelinedescribedinthefividia’s
Geforce6800asanexamplewewillhaveacloserlookatthearchitectureofmoderndayGPUs.
Sincebeingfoundedin1993,thecompanyNVIDIAhasbecomeoneofthebiggestmanufacturersofGPUs
(besidesATI),havingreleasedimportantchipssuchastheGeforce256,andtheGeforce3.
Launchedin2004,theGeforce6800belongstotheGeforce6series,Nvidia’ssixthgenerationofgraphics
chipsetsandthefourthgenerationthatfeaturedprogrammability(moreonthatlater).
ThefollowingimageshowsaschematicviewoftheGeforce6800anditsfunctionalunits.
Figure13:SchematicviewoftheGeforce6800
Youcanalreadyseehoweachofthefunctionalunitscorrespondtothestagesofthegraphicspipeline.
Westartwithsixparallelvertexprocessorsthatreceivedatafromthehost(theCPU)andperformoper-
ationssuchastransformationandlighting.
8
Next,theoutputgoesintothetrianglesetupstagewhichtakescareofprimitiveassembly,cullingand
clipping,orce6800hasanadditional
Z-cullunitwhichallowstoperformanearlyfragmentvisibilitycheckbasedondepth,furtherimprovingthe
efficiency.
Wethenmoveontothesixteenfragmentprocessorswhichoperatein4parallelunitsandcomputesthe
outputcolorsofeachfragment.
Thefragmentcrossbarisalinkingelementthatisbasicallyresponsiblefordirectingoutputpixelstoany
availablepixelengine(alsocalledROP,shortforRasterOperator),thusavoidingpipelinestalls.
The16pixelenginesarethefinalstageofprocessing,andperformoperationssuchasalphablending,
depthtests,etc.,beforedeliveringthefinalpixeltotheframebuffer.
3.2InDetail
Figure14:AmoredetailedviewoftheGeforce6800
WhilemostpartsoftheGPUarefixedfunctionunits,thevertexandfragmentprocessorsoftheGeforce
6800offerprogrammabilitywhichwasfirstintroducedtothegeforcechipsetlinewiththegeforce3(2001).
We’llhaveamoredetailedlookattheunitsinthefollowingsections.
9
3.2.1VertexProcessor
Figure15:Avertexprocessor
Thevertexprocessorsaretheprogrammableunitsresponsibleforallthevertextransformationsandat-
eratewith4-dimensionaldatavectorscorrespondingwiththeaforementioned
homogeneouscoordinatesofavertex,using32bitspercoordinate(hencethe128bitsofaregister).Instruc-
tionsare123bitslongandarestoredintheInstructionRAM.
Thedatapathofavertexprocessorconsistsof:
•Amultiply-addunitfor4-dimensionalvectors
•Ascalarspecialfunctionunit
•Atextureunit
Instructionset:
Somenotableinstructionsforthevertexprocessorinclude:
dp4dst,src0,src1
expdst,src
dstdest,src0,src1
nrmdst,src
rsqdst,src
Computesthefour-componentdotproductofthesourceregisters
Providesexponential2
x
Calculatesadistancevector
Normalizea3Dvector
Computesthereciprocalsquareroot(positiveonly)ofthesource
scalar
Registersinthevertexprocessorinstructionscanbemodified(withfewexceptions):
•
•
•
•
Negatetheregistervalue
Taketheabsolutevalueoftheregister
Swizzling(copyanysourceregistercomponenttoanytemporaryregistercomponent)
Maskdestinationregistercomponents
10
Othertechnicaldetails:
•
•
•
•
•
VertexprocessorsareMIMDunits(MultipleInstructionMultipleData)
TheyuseVLIW(VeryLongInstructionWords)
Theyoperatewith32-bitfloatingpointprecision
Eachvertexprocessorrunsupto3threadstohidelatency
Eachvertexprocessorcanperformafour-wideMAD(Multiply-Add)andaspecialfunctioninone
cycle
3.2.2FragmentProcessor
Figure16:Afragmentprocessor
egroupedto4biggerunitswhichoperatesimulta-
neouslyon4fragmentseach(aso-calledquad).Theycantakeposition,color,depth,fogaswellasother
arbitrary4-dimensionalattributesasinput.
Thedatapathconsistsof:
•AnInterpolationblockforattributes
•2vectormath(shader)units,eachwithslightlydifferentfunctionality
•Afragmenttextureunit
Superscalarity:
Afragmentprocessorworkswith4-vectors(vector-orientedinstructionset),wheresometimescomponentsof
thevectorneedbetreatedseperately(,alpha).Thus,thefragmentprocessorsupportsco-issueing
ofthedata,whichmeanssplittingthevectorinto2partsandexecutingdifferentoperationsontheminthe
orts3-1and2-2splitting(2-2co-issuewasn’tpossibleearlier).
Additionally,italsofeaturesdualissue,whichmeansexecutingdifferentoperationsonthe2vectormath
unitsinthesameclock.
TextureUnit:
Thetextureunitisafloating-pointtextureprocessorwhichfetchesandfin-
nectedtoalevel1texturecache(whichstorespartsofthetexturesthatareused).
11
Shaderunits1and2:
Eachshaderunitislimitedinitsabilities,offeringcompletefunctionalitywhenusedtogether.
Figure17:BlockdiagramofShaderUnit1and2
ShaderUnit1:
Green:Acrossbarwhichdistributestheinputcomingeiterfromtherasterizerorfromtheloopback
Red:Interpolators
Yellow:Aspecialfunctionunit(forfunctionssuchasReciprocal,ReciprocalSquareRoot,etc.)
Cyan:MULchannels
Orange:Aunitfortextureoperations(notthefragmenttextureunit)
Theshaderunitcanperform2operationsperclock:
AMULona3-dimensionalvectorandaspecialfunction,aspecialfunctionandatextureoperation,or2
MULs.
TheoutputofthespecialfunctionunitcangointotheMULchannels.
ThetexturegetsinputfromtheMULunitanddoesLOD(LevelOfDetail)calculations,beforepassing
gmenttextureunitthenperformstheactualsampling
andwritesthedataintoregistersforthesecondshaderunit.
Theshaderunitcansimplypassdataaswell.
ShaderUnit2:
Red:Acrossbar
Cyan:4MULchannels
Gray:4ADDchannels
Yellow:1specialfunctionunit
Thecrossbarsplitstheinputonto5channels(4components,1channelstaysfree).
TheADDunitsareadditionallyconnected,allowingadvancedoperationssuchasadotproductinoneclock.
Again,theshaderu
specialfunctionisused,theMADunitcanperformupto2operationsfromthislist:MUL,ADD,MAD,
12
DP,oranyotherinstructionbasedontheseoperations.
Instructionset:
Somenotableinstructionsforthevertexprocessorinclude:
cmpdst,src0,src1,src2
dsxdst,src
dsydst,src
sincosdst.{x|y|xy},src0.{x|y|z|w}
texlddst,src0,src1
Choosesrc1ifsrc0>=ise,parison
isdoneperchannel
Computetherateofchangeintherendertarget’sx-direction
Computetherateofchangeintherendertarget’sy-direction
Computessineandcosine,inradians
Sampleatextureataparticularsampler,usingprovidedtexture
coordinates
Registersinthefragmentprocessorinstructionscanbemodified(withfewexceptions):
•Negatetheregistervalue
•Taketheabsolutevalueoftheregister
•Maskdestinationregistercomponents
Othertechnicaldetails:
•Thefragmentprocessorscanperformoperationswithin16or32floatingpointprecision(
unitusesonly16bitprecisionforitscalculationssincethatissufficient)
•ThequadsoperateasSIMDunits
•TheyuseVLIW
•Theyrunupto100softhreadstohidetexturefetchlatency(˜256perquad)
•Afragmentprocessorcanperformupto8operationspercycle/4mathoperationsifthere’satexture
fetchinshader1
Figure18:Possibleoperationspercycle
•Thefragmentprocessorshavea2leveltexturecache
•Thefogunitcanperformfogblendingonthefiplemented
withfixedpointprecisionsincethat’ssufficientforfogandsavesperformance.
Theequation:out=FogColor*fogFraction+SrcColor*(1-fogFraction)
13
•There’ssupportformultiplerendertargets,thepixelprocessorcanoutputtouptofourseperate
buffers(4x4values,color+depth)
3.2.3PixelEngine
Figure19:Apixelengine
Lastinthepipelinearethe16pixelengines(rasteroperators).Eachpixelengineconnectstoaspecific
helosslesscoloranddepthcompression,thedepthandcolorunits
performdepth,colorandstenciloperationsbeforewritingthefitivatedthepixelengines
alsoperformmultisampleantialiasing.
3.2.4Memory
From“GPUGems2,Chapter30:TheGeForce6SeriesGPUArchitecture”:
“Thememorysystemispartitionedintouptofourindependentmemorypartitions,each
withitsowndynamicrandom-accessmemories(DRAMs).GPUsusestandardDRAMmodules
ratherthancustomRAMtechnologiestotakeadvantageofmarketeconomiesandtherebyreduce
smaller,independentmemorypartitionsallowsthememorysubsystemtooperate
efficderedsurfaces
arestoredintheDRAMs,whiletexturesandinputdatacanbestoredintheDRAMsorin
rindependentmemorypartitionsgivetheGPUawide(256bits),
flexiblememorysubsystem,allowingforstreamingofrelativelysmall(32-byte)memoryaccesses
atnearthe35GB/secphysicallimit.”
14
3.3Performance
•425MHzinternalgraphicsclock
•550MHzmemoryclock
•256-MBmemorysize
•35.2GByte/secondmemorybandwidth
•600millionvertices/second
•6.4billiontexels/second
•12.8billionpixels/second,renderingz/stencil-only(usefulforshadowvolumesandshadowbuffers)
•6four-widefp32vectorMADsperclockcycleinthevertexshader,plusonescalarmultifunction
operation(acomplexmathoperation,suchasasineorreciprocalsquareroot)
•16four-widefp32vectorMADsperclockcycleinthefragmentprocessor,plus16four-widefp32
multipliesperclockcycle
•64pixelsperclockcycleearlyz-cull(rejectrate)
•120+Gflopspeak(equaltosix5-GHzPentium4processors)
•Upto120Wenergyconsumption(thecardhastwoadditionalpowerconnectors,thepowersources
arerecommendedtobenolessthan480W)
15
4ComputationalPrinciples
StreamProcessing:
TypicalCPUs(thevonNeumannarchitecture)suffe
verysensitivetosuchbottlenecks,andthereforeneedadifferentarchitecture,theyareessentiallyspecial
purposestreamprocessors.
Astrmisasetofdata
amprocessors,everykerneltakesoneormorestreamsasinputand
outputsoneormorestreams,whileitexecutesitsoperationsoneverysingleelementoftheinputstreams.
Instreamprocessorsyoucanachieveseverallevelsofparallelism:
•Instructionlevelparallelism:kernelsperformhundredsofinstructionsoneverystreamelement,you
achieveparallelismbyperformingindependentinstructionsinparallel
•Datalevelparallelism:kernelsperformthesameinstructionsoneachstreamelement,youachieve
parallelismbyperformingoneinstructiononmanystreamelementsatatime
•Tasklevelparallelism:Havemultiplestreamprocessorsdividetheworkfromonekernel
Streamprocessorsdonotusecachingthesamewaytraditionalprocessorsdosincetheinputdatasetsare
usuallymuchlargerthanmostcachesandthedataisbarelyreused-withGPUsforexamplethedatais
usuallyrenderedandthendiscarded.
WeknowGPUshavetoworkwithlargeamountsofdata,thecomputationsaresimplerbuttheyneed
tobefastandparallel,soitbecomesclearthatthestreamprocessorarchitectureisverywellsuitedforGPUs.
Continuingtheseideas,GPUsemployfollowingstrategiestoincreaseoutput:
Pipelining:Pipeliningdescribestheideaofbreakingdownajobintomultiplecomponentsthateachperform
epipelined,whichmeansthatinsteadofperformingcompleteprocessingofapixel
beforemovingontothenext,youfillthepipelinelikeanassemblylinewhereeachcomponentperformsa
eprocessingapixelmaytakemultipleclock
cycles,youstillachieveanoutputofonepixelperclocksinceyoufillupthewholepipe.
Parallelism:Duetothenatureofthedata-parallelismcanbeappliedonaper-vertexorper-pixelbasis
-andthetypeofprocessing(highlyrepetitive)GPUsareverysuitableforparallelism,youcouldhavean
unlimitedamountofpipelinesnexttoeachother,aslongastheCPUisabletokeepthembusy.
OtherGPUcharacteristics:
•GPUscanaffordlargeamountsoffloatingpointcomputationalpowersincetheyhavelowercontrol
overhead
•Theyusededicatedfunctionalunitsforspecializedtaskstoincreasespeeds
•GPUmemorystruggleswithbandwidthlimitations,andthereforeaimsformaximumbandwidthusage,
employingstrategieslikedatacompression,multiplethreadstocopewithlatency,schedulingofDRAM
cyclestominimizeidledata-bustime,etc.
•Cachesaredesignedtosupporteffectivestreamingwithlocalreuseofdata,ratherthanimplementing
acachethatachieves99%hitrates(whichisn’tfeasible),GPUcachedesignsassumea90%hitrate
withmanymissesinflight
•GPUshavemanydifferentperformanceregimesallwithdifferentcharacteristicsandneedtobede-
signedaccordingly
16
4.1TheGeforce6800asageneralprocessor
YoucanseetheGeforce6800asageneralprocessorwithalotoffloating-pointhorsepowerandhighmemory
bandwidththatcanbeusedforotherapplicationsaswell.
Figure20:AgeneralviewoftheGeforce6800architecture
LookingattheGPUthatway,weget:
•2seriallyrunningprogrammableblockswithfp32capability.
•TheRasterizercanbeseenasaunitthatexpandsthedataintointerpolatedvalues(fromonedata-
”point”tomultiple”fragments”).
•WithMRT(MultipleRenderTargets),thefragmentprocessorcanoutputupto16scalarfloating-point
valuesatatime.
•Severalpossibilitiestocontrolthedataflowbyusingthevisibilitychecksofthepixelenginesorthe
Z-cullunit
17
5Thenextstep:theGeforce8800
AftertheGeforce7serieswhichwasacontinuationoftheGeforce6800architecture,Nvidiaintroducedthe
bythedesiretoincreaseperformance,improveimagequalityandfacilitate
programming,theGeforce8800presentedasignificantevolutionofpastdesigns:aunifiedshaderarchitec-
ture(Note,thatATIalreadyusedthisarchitecturein2005withtheXBOX360GPU).
Figure21:Fromdedicatedtounifiedarchitecture
Figure22:AschematicviewoftheGeforce8800
TheunifiedshaderarchitectureoftheGeforce8800essentiallyboilsdowntothefactthatallthedifferent
shaderstagesbecomeonesinglestagethatcanhandleallthedifferentshaders.
AsyoucanseeinFigure22,insteadofdifferentdedicatedunitswenowhaveasinglestreamingprocessor
familiarunitssuchastherasteroperators(blue,atthebottom)andthetrianglesetup,
stheseunitswenowhaveseveralmanagingunitsthatprepareand
managethedataasitflowsintheloop(vertex,geometryandpixelthreadissue,inputassemblerandthread
processor).
18
Figure23:Thestreamingprocessorarray
xtureprocessorclusterin
turnconsistsof2streamingmultiprocessorsand1texturepipe.
Astreamieamingpro-
cessorsworkwith32-bitscalardata,basedontheideathatshaderprogramsarebecomingmoreandmore
scalar,makingavectorarchitecturemoreineffiedrivenbyahigh-speedclockthatisseperate
ltiprocessorcan
have768hardwarescheduledthreads,groupedtogetherto24SIMD”warps”(Awarpisagroupofthreads).
Thetexturepipeconsistsof4textureaddressingand8texturefiormstextureprefetching
andfilteringwithoutconsumingALUresources,furtherincreasingefficency.
Itmple,theold
problemofconstantlychangingworkloadandoneshaderstagebecomingaprocessingbottleneckissolved
sincetheunitscanadaptdynamically,nowthattheyareunified.
Figure24:Workloadbalancingwithbotharchitectures
Withasingleinstructionsetandthesupportoffp32throughoutthewholepipeline,aswellasthesupport
ofnewdatatypes(integercalculations),programmingtheGPUnowbecomeseasieraswell.
19
6GeneralPurposeProgrammingontheGPU-anexample
WeusethebitonicmergesortalgorithmasanexampleforefficientlyimplementingalgorithmsonaGPU.
Bitonicmergesort:
Bitonicmergesortworksbyrepeatedlybuildingbitoniclistsoutofasetofelementsandsortingthem.A
bitoniclistisaconcatenationoftwomonotoniclists,oneascendingandonedescending.
E.g.:
ListA=(3,4,7,8)
ListB=(6,5,2,1)
ListAB=(3,4,7,8,6,5,2,1)isabitoniclist
ListBA=(6,5,2,1,3,4,7,8)isalsoabitoniclist
Bitoniclistshaveaspecialpropertythatisusedinbitonicmergesort:Supposeyouhaveabitoniclistof
rearrangethelistsothatyougettwohalveswithnelementswhereeachelement(i)of
thefirsthalfislessthanorequaltoeachcorrespondingelement(i+n)inthesecondhalf(orgreaterthanor
equal,ifthelistdescendsfirstandthenascends)ppensbycom-
ocedureiscalledabitonicmerge.
Bitonicmergesortworksbyrecursivelycreatingandmergingbitonicliststhatincreaseintheirsizeuntil
25illustratestheprocess:
Figure25:Thedifferentstagesofthealgorithm
Thesortingprw
sultsinbitonicmergesorthavingacomplexityof
O(nlog
2
(n)+log(n))whichisworsethanquicksort,butthealgorithmhasnoworst-casescenario(where
quicksortreachesO(n
2
).
theoperationscanbeperformedinparalleland
thelengthstaysconstant,nimplementingthisalgorithmon
theGPU,wewanttomakeuseofasmanyresourcesaspossible(bothinparallelaswellasverticallyalong
20
thepipeline),especiallyconsideringthattheGPUhasshortcomingsaswell,suchasthelackofsupport
mple,simplylettingthefragmentprocessorstagehandleallthe
calculationsmightwork,ble
solutionlookslikethis:
Inthisalgorithm,wehavegroupsofelements(fragments)thathavethesamesortingconditions,while
edrawavertexquadovertwoadjacentgroupsand
setappropriateflagsateachcorner,
example,ifwesettheleftcornersto-1andtherightcornersto+1,wecancheckwhereafragmentbelongs
tobysimplylookingatitssign(theinterpolationprocesstakescareofthat).
Figure26:Usingvertexflagsandtheinterpolatortodeterminecompareoperations
Next,weneedtodeterminewhichcompareoperationtouseandweneedtolocatethepartneritemto
nagainbeaccomplishedbyusingtheflgthecompareoperationtoless-than
andmultiplyingwiththeflagvalueimplicitlyflipstheoperationtogreater-equalhalfwayacrossthequad.
Locatingthepartneritemhappensbymultiplyingthesignoftheflagwithanabsolutevaluethatspecifies
thedistancebetweentheitems.
Inordertosortelementsofanarray,westorethemina2Dtexture.
Eachrowisatend
thequadsovertherowsofthe2Dtextureandusetheinterpolation,wecanmodulatethecomparisonsothe
y,pairsofrowsbecomebitonic
sequencesagainwhichcanbesortedinthesamewaywesortedthecolumnsofthesinglerows,simplyby
transposingthequads.
Asafinaloptimizationwereducetexturefetchingbypackingtwoneighbouringkeypairsintoonefrag-
ment,sincetheshaderoperateson4-vectors.
Performancecomparison:
std:sort:16-BitData,
Pentium43.0GHz
NFullSorted
Sorts/SecKeys/Sec
2
25682.55.4M
2
51220.65.4M
2
10244.75.0M
BitonicMergeSort:16-BitFloatData,
NVIDIAGeforce6800Ultra
NPassesFullSorted
Sorts/SecKey/Sec
2
25612090.076.1M
2
51215318.34.8M
2
10241903.63.8M
21
GLSL(OpenGLShadingLanguage)codesample,implementingthecombinedpasses0and1forrow-wise
sortingofthebitonicmergesort:
uniformsampler2DPackedData:
//contentsofthetexcoorddata
#defineOwnPosgl_TexCoord[0].xy
#defineSearchDirgl_TexCoord[0].z
#defineCompOpgl_TexCoord[0].w
#defineDistancegl_TexCoord[1].x
#defineStridegl_TexCoord[1].y
#defineHeightgl_TexCoord[1].z
#defineHalfStrideMHalfgl_TexCoord[1].w
voidmain(void)
{
//getself
vec4self=texture2D(PackedData,OwnPos);
//restoresignofsearchdirectionandassemblevectortopartner
vec2adr=vec2((SearchDir<0.0)?-Distance:Distance,0.0);
//getthepartner
vec4partner=texture2D(PackedData,OwnPos+adr);
//switchascending/descendingsortforeveryotherrow
//bymodifyingcomparisonflag
floatcompare=CompOp*-(mod(floor(gl_TexCoord[0].y*Height),Stride)-HalfStrideMHalf);
//xandyarethekeysofthetwoitems
//-->multiplywithcomparisonflag
vec4keys=compare*vec4(self.x,self.y,partner.x,partner.y);
//comparethekeysandstoreaccordingly
//zandwaretheindices
//-->justcopythemaccordingly
vec4result;
=(keys.x =(keys.y //dopass0 compare*=adr.x; gl_FragColor=(result.x*compare } 22 7Currentandfuturedevelopments Nvidia’scurrenttopofthelinemodelofgraphicscardsistheGeforceGTX280(GTX200series),an evolutionoftheunifiedshaderarchitectureoftheGeforce8800,sportingalmostdoubletheshadercount (from128to240)nchdatewasthe17thofJune2008. ATI(nowmergedintoAMD)followedin2007withitsfirstunifiedshaderGPUforthePC(Radeon HD2900XT),renttop ofthelinemodelistheRadeonHD3870X2(whichactuallysports2GPUsononecard)whichwasreleased inJanuary2008. ATI/AMDaresoontofollowupwithananswertoNvidia’sGTX280:theRadeonHD4870(slatedfor somewherearoundJuly2008). Withtheadventoftheunifiedshaderarchitecture,thetopicofgeneral-purposecomputingonaGPUhas ,GPUshavemadetheirwayintonon-graphicsfieldsasvariedas audiosignalprocessingandweatherforecasting. BothATI/AMDandNvidiahavemadeeff releasedCTM(CloseToMetal),ewritingthesoftware,CTM’s commercialsuccessorAMDStreamSDKwasreleasedinDecember2007,nowprovidingadditionalhighlevel toolsforgeneral-purposeaccesstoAMDgraphicshardware. NvidiainitiallyreleasedtheCUDA(ComputeUnifiedDeviceArchitecture)SDKinFebruary2007,aC lauary2008,Nvidia boughtAgeiaandtheirPhysXengine(aproprietaryrealtimephysicsenginemiddlewareSDK)andinte- grateditintotheirCUDAframework. tuallyscalewell beyondMoore’slaw,charapiddevelopmentwecan certainlyexpecttoseequitesomeinterestingthingstocomeinthisfieldofprocessing. References [1]WikipediaentryonGPUs /wiki/GPU [2]KeesHuizing,Han-WeiShen:“TheGraphicsRenderingPipeline” / ~ keesh/ow/2IV40/ [3]CyrilZeller:“IntroductiontotheHardwareGraphicsPipeline”,PresentationatACMSIGGRAPH 2005 /developer/presentations/2005/I3D/I3D_05_ [4]ExtremeTech3DPipelineTutorial /article2/0,1697,9722, [5]AshuRege:“Introductionto3DGraphicsforGames” /docs/IO/11278/ [6]DirectXDeveloperCenter:“TheDirect3DTransformationPipeline” /en-us/library/bb206260(VS.85).aspx [7]MarkColbert:“GPUArchitecture&CG” /gpuseminar/ [8]GPUGems2,Chapter30:“TheGeForce6SeriesGPUArchitecture” /developer/GPU_Gems_2/GPU_Gems2_ [9]IEEEMicro,Volume25,Issue2(March2005):“TheGeForce6800” /?id=1069760 [10]:“NV40-TechnikimDetail” /artikel/nv40_pipeline/ 23 [11]:“NVIDIAGeForce6800Ultra(NV40)” /articles2/gffx/ [12]AustinRobison,AbeWinter:“AnOverviewofGraphicsProcessingHardware” / ~ robison/src/gpu_ [13]JohnMontrym,HenryMoreton:“NVIDIAGeForce6800”,HotChips16 /archives/hc16/2_Mon/13_HC16_Sess3_Pres1_ [14]AjitDatar,ApurvaPadhye:“GraphicsProcessingUnitArchitecture” / ~ data0003/Talks/ [15]SvenSchenk:“EineEinfuehrungindieArchitekturmodernerGraphikprozessoren” /Lehre/Seminar0506/ [16]ThomasScottCrow:“EvolutionoftheGraphicalProcessingUnit” / ~ fredh/papers/thesis/023-crow/ [17]DirectXDeveloperCenter:“AsmShaderReference” /en-us/library/bb219840(VS.85).aspx [18]ErikLindholm,StuartOberman:“NVIDIAGeForce8800GPU” /archives/hc19/2_Mon/HC19.02/ [19]:“SayHelloToDirectX10,Or128ALUsInAction:NVIDIAGeForce8800GTX(G80)” /articles2/video/ [20]RichardHough,RichardYu:“GPUArchitecture” /courses/ece685/slides/ [21]TechnicalBrief:“NVIDIAGeForce8800GPUArchitectureOverview” /object/IO_ [22]GPUGems2,Chapter46:“ImprovedGPUSorting” [23]TimPurcell:“SortingandSearching”,SIGGRAPH2005GPGPUCOURSE /s2005/slides/ [24]PeterKipfer,MarkSegal,RuedigerWestermann:“UberFlow:AGPU-BasedParticleEngine” /previous/www_2004/Presentations/ [25]WikipediaentryonNvidia /wiki/Nvidia_Corporation [26]WikipediaentryonATI /wiki/ATI_Technologies_Inc. [27]WikipediaentryonCUDA /wiki/CUDA [28]WikipediaentryonCTM /wiki/Close_to_Metal [29]WilliamMark,HenryMoreton:“3DGraphicsArchitectureTutorial” /users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_ 24