图形处理器架构(GPU Architecture)与图形管线(Graphics Pipeline-USB迷|专注于互联网分享

2024年7月22日发(作者：宜方仪)

GPUs-GraphicsProcessingUnits

MinhTriDoDinh

-Dinh@

VertiefungsseminarArchitekturvonProzessoren,SS2008

InstituteofComputerScience,UniversityofInnsbruck

July7,2008

Thisores

theirarchitectureandunderlyingdesignprinciples,usingchipsfromNvidia’s”Geforce”seriesas

examples.

1Introduction

BeforewediveintothearchitecturaldetailsofsomeexampleGPUs,we’llhavealookatsomebasicconcepts

ofgraphicsprocessingand3Dgraphics,whichwillmakeiteasierforustounderstandthefunctionalityof

GPUs

1.1WhatisaGPU?

AGPU(GraphicsProcessingUnit)isessentiallyadedicatedhardwaredevicethatisresponsiblefortrans-

paper,wewillfocusonthe3Dgraphics,sincethatis

whatmodernGPUsaremainlydesignedfor.

1.2Theanatomyofa3Dscene

Figure1:A3Dscene

3Dscene:Acollectionof3Dobjectsandlights.

Figure2:Object,triangleandvertices

3Dobjects:Arbitraryobjects,nsarecomposedof

vertices.

Vertex:APointwithspatialcoordinatesandotherinformationsuchascolorandtexturecoordinates.

Figure3:Acubewithacheckerboardtexture

Texture:Animagethatismappedontothesurfaceofa3Dobject,whichcreatestheillusionofanobject

ticesofanobjectstoretheso-calledtexturecoordinates

(2-dimensionalvectors)thatspecifyhowatextureismappedontoanygivensurface.

Figure4:Texturecoordinatesofatrianglewithabricktexture

Inordertotranslatesucha3Dscenetoa2Dimage,thedatahastogothroughseveralstagesofa”Graphics

Pipeline”

1.3TheGraphicsPipeline

Figure5:The3DGraphicsPipeline

First,amongsomeotheroperations,wehavetotranslatethedatathatisprovidedbytheapplicationfrom

3Dto2D.

1.3.1GeometryStage

Thisstageisalsoreferredtoasthe”TransformandLighting”rtotranslatethescenefrom

3Dto2D,alltheobjectsofasceneneedtobetransformedtovariousspaces-eachwithitsowncoordinate

ransformationsareappliedona

vertex-to-vertexbasis.

MathematicalPrinciples

Apointin3Dspaceusuallyhas3coordinates,epusing3-dimensional

vectorsforthetransformationcalculations,werunintotheproblemthatdiﬀerenttransformationsrequire

diﬀ:translatingavertexrequiresadditionwithavectorwhilerotatingavertexrequires

multiplicationwitha3x3matrix).Wecircumventthisproblemsimplybyextendingthe3-dimensionalvector

byanothercoordinate(thew-coordinate),y,

everytransformationcanbeappliedbymultiplyingthevectorwithaspeciﬁc4x4matrix,makingcalculations

mucheasier.

Figure6:Transformationmatricesfortranslation,rotationandscaling

Lighting,theothermajorpartofthispipelinestageiscalculatedusingthenormalvectorsofthesurfaces

inationwiththepositionofthecameraandthepositionofthelightsource,onecan

computethelightingpropertiesofagivenvertex.

Figure7:Calculatinglighting

Fortransformation,westartoutinthemodelspacewhereeachobject(model)hasitsowncoordinate

system,whichfacilitatesgeometrictransformationssuchastranslation,rotationandscaling.

Afterthat,wemoveontotheworldspace,whereallobjectswithinthescenehaveauniﬁedcoordinate

system.

Figure8:Worldspacecoordinates

Thenextstepisthetransformationintoviewspace,whichlocatesacameraintheworldspaceandthen

transformsthescene,suchthatthecameraisattheoriginoftheviewspace,lookingstraightintothe

andeﬁneaviewvolume,theso-calledviewfrustrum,whichwillbeusedto

decidewhatactuallyisvisibleandneedstoberendered.

Figure9:Thecamera/eye,theviewfrustrumanditsclippingplanes

Afterthat,theverticesaretransformedintoclipspaceandassembledintoprimitives(trianglesorlines),

bjectsthatareoutsideofthefrustrumdon’tneedto

berenderedandcanbediscarded,objectsthatarepartiallyinsidethefrustrumneedtobeclipped(hence

thename),andnewverticeswithpropertextureandcolorcoordinatesneedtobecreated.

Aperspectivedivideisthenperformed,whichtransformsthefrustrumintoacubewithnormalized

coordinates(xandybetween-1and1,zbetween0and1)whiletheobjectsinsidethefrustrumarescaled

thisnormalizedcubefacilitatesclippingoperationsandsetsuptheprojectioninto2D

space(thecubesimplyneedstobe”ﬂattened”).

Figure10:Transformingintoclipspace

Finally,wecanmoveintoscreenspacewherexandycoordinatesaretransformedforproper2Ddisplay

(inagivenwindow).(Notethatthez-coordinateofavertexisretainedforlaterdepthoperations)

Figure11:Fromviewspacetoscreenspace

Note,thatthetexturecoordinatesneedtobetransformedaswellandadditionallybesidesclipping,sur-

facesthataren’tvisible(ksideofacube)areremovedaswell(so-calledbackfaceculling).

Theresultisa2Dimageofthe3Dscene,andwecanmoveontothenextstage.

1.3.2RasterizationStage

needstotraversethe2Dimageandconvert

thedataintoanumberof”pixel-candidates”,so-calledfragments,whichmaylaterbecomepixelsofthe

ﬁentisadatastructurethatcontainsattributessuchasposition,color,depth,texture

coordinates,eneratedbycheckingwhichpartofanygivenprimitiveintersectswithwhichpixel

gmentintersectswithaprimitive,butnotanyofitsvertices,theattributesofthat

fragmenthavetobeadditionallycalculatedbyinterpolatingtheattributesbetweenthevertices.

Figure12:Rasterizingatriangleandinterpolatingitscolorvalues

Afterthat,furtherstepscanbemadetoobtaintheﬁarecalculatedbycombining

textureswithotherattributessuchascolorandlightingorbycombiningafragmentwitheitheranother

translucentfragment(so-calledalphablending)oroptionalfog(anothergraphicaleﬀect).

Visibilitychecksareperformedsuchas:

•

Scissortest(checkingvisibilityagainstarectangularmask)

Stenciltest(similartoscissortest,onlyagainstarbitrarypixelmasksinabuﬀer)

Depthtest(comparingthez-coordinateoffragments,discardingthosewhicharefurtheraway)

Alphatest(checkingvisibilityagainsttranslucentfragments)

Additionalprocedureslikeanti-aliasingcanbeappliedbeforeweachievetheﬁnalresult:anumberof

pixelsthatcanbewrittenintomemoryforlaterdisplay.

Thisconcludesourshorttourthroughthegraphicspipeline,whichhopefullygivesusabetterideaofwhat

kindoffunctionalitywillberequiredofaGPU.

2EvolutionoftheGPU

SomehistoricalkeypointsinthedevelopmentoftheGPU:

•Eﬀortsforrealtimegraphicshavebeenmadeasearlyas1944(MIT’sproject”Whirlwind”)

•Inthe1980s,hardwaresimilartomodernGPUsbegantoshowupintheresearchcommunity(“Pixel-

Planes”,aaparallelsystemforrasterizingandtexture-mapping3Dgeometry

•Graphicchipsintheearly1980swereverylimitedintheirfunctionality

•Inthelate1980sandearly1990s,high-speed,general-purposemicroprocessorsbecamepopularfor

implementinghigh-endGPUs(nstruments’TMS340)

•1985Theﬁrstmass-marketgraphicsacceleratorwasincludedintheCommodoreAmiga

•1991S3introducedtheﬁrstsinglechip2D-accelerator,theS386C911

•1995Nvidiareleasesoneoftheﬁrst3Daccelerators,theNV1

•1999Nvidia’sGeforce256istheﬁrstGPUtoimplementTransformandLightinginHardware

•2001NvidiaimplementstheﬁrstprogrammableshaderunitswiththeGeforce3

•2005ATIdevelopstheﬁrstGPUwithuniﬁedshaderarchitecturewiththeATIXenosfortheXBox

360

•2006NvidialaunchestheﬁrstuniﬁedshaderGPUforthePCwiththeGeforce8800

3FromTheorytoPractice-theGeforce6800

3.1Overview

ModernGPUscloselyfollowthelayoutofthegraphicspipelinedescribedintheﬁvidia’s

Geforce6800asanexamplewewillhaveacloserlookatthearchitectureofmoderndayGPUs.

Sincebeingfoundedin1993,thecompanyNVIDIAhasbecomeoneofthebiggestmanufacturersofGPUs

(besidesATI),havingreleasedimportantchipssuchastheGeforce256,andtheGeforce3.

Launchedin2004,theGeforce6800belongstotheGeforce6series,Nvidia’ssixthgenerationofgraphics

chipsetsandthefourthgenerationthatfeaturedprogrammability(moreonthatlater).

ThefollowingimageshowsaschematicviewoftheGeforce6800anditsfunctionalunits.

Figure13:SchematicviewoftheGeforce6800

Youcanalreadyseehoweachofthefunctionalunitscorrespondtothestagesofthegraphicspipeline.

Westartwithsixparallelvertexprocessorsthatreceivedatafromthehost(theCPU)andperformoper-

ationssuchastransformationandlighting.

Next,theoutputgoesintothetrianglesetupstagewhichtakescareofprimitiveassembly,cullingand

clipping,orce6800hasanadditional

Z-cullunitwhichallowstoperformanearlyfragmentvisibilitycheckbasedondepth,furtherimprovingthe

eﬃciency.

Wethenmoveontothesixteenfragmentprocessorswhichoperatein4parallelunitsandcomputesthe

outputcolorsofeachfragment.

Thefragmentcrossbarisalinkingelementthatisbasicallyresponsiblefordirectingoutputpixelstoany

availablepixelengine(alsocalledROP,shortforRasterOperator),thusavoidingpipelinestalls.

The16pixelenginesaretheﬁnalstageofprocessing,andperformoperationssuchasalphablending,

depthtests,etc.,beforedeliveringtheﬁnalpixeltotheframebuﬀer.

3.2InDetail

Figure14:AmoredetailedviewoftheGeforce6800

WhilemostpartsoftheGPUareﬁxedfunctionunits,thevertexandfragmentprocessorsoftheGeforce

6800oﬀerprogrammabilitywhichwasﬁrstintroducedtothegeforcechipsetlinewiththegeforce3(2001).

We’llhaveamoredetailedlookattheunitsinthefollowingsections.

3.2.1VertexProcessor

Figure15:Avertexprocessor

Thevertexprocessorsaretheprogrammableunitsresponsibleforallthevertextransformationsandat-

eratewith4-dimensionaldatavectorscorrespondingwiththeaforementioned

homogeneouscoordinatesofavertex,using32bitspercoordinate(hencethe128bitsofaregister).Instruc-

tionsare123bitslongandarestoredintheInstructionRAM.

Thedatapathofavertexprocessorconsistsof:

•Amultiply-addunitfor4-dimensionalvectors

•Ascalarspecialfunctionunit

•Atextureunit

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

dp4dst,src0,src1

expdst,src

dstdest,src0,src1

nrmdst,src

rsqdst,src

Computesthefour-componentdotproductofthesourceregisters

Providesexponential2

Calculatesadistancevector

Normalizea3Dvector

Computesthereciprocalsquareroot(positiveonly)ofthesource

scalar

Registersinthevertexprocessorinstructionscanbemodiﬁed(withfewexceptions):

•

Negatetheregistervalue

Taketheabsolutevalueoftheregister

Swizzling(copyanysourceregistercomponenttoanytemporaryregistercomponent)

Maskdestinationregistercomponents

Othertechnicaldetails:

•

VertexprocessorsareMIMDunits(MultipleInstructionMultipleData)

TheyuseVLIW(VeryLongInstructionWords)

Theyoperatewith32-bitﬂoatingpointprecision

Eachvertexprocessorrunsupto3threadstohidelatency

Eachvertexprocessorcanperformafour-wideMAD(Multiply-Add)andaspecialfunctioninone

cycle

3.2.2FragmentProcessor

Figure16:Afragmentprocessor

egroupedto4biggerunitswhichoperatesimulta-

neouslyon4fragmentseach(aso-calledquad).Theycantakeposition,color,depth,fogaswellasother

arbitrary4-dimensionalattributesasinput.

Thedatapathconsistsof:

•AnInterpolationblockforattributes

•2vectormath(shader)units,eachwithslightlydiﬀerentfunctionality

•Afragmenttextureunit

Superscalarity:

Afragmentprocessorworkswith4-vectors(vector-orientedinstructionset),wheresometimescomponentsof

thevectorneedbetreatedseperately(,alpha).Thus,thefragmentprocessorsupportsco-issueing

ofthedata,whichmeanssplittingthevectorinto2partsandexecutingdiﬀerentoperationsontheminthe

orts3-1and2-2splitting(2-2co-issuewasn’tpossibleearlier).

Additionally,italsofeaturesdualissue,whichmeansexecutingdiﬀerentoperationsonthe2vectormath

unitsinthesameclock.

TextureUnit:

Thetextureunitisaﬂoating-pointtextureprocessorwhichfetchesandﬁn-

nectedtoalevel1texturecache(whichstorespartsofthetexturesthatareused).

Shaderunits1and2:

Eachshaderunitislimitedinitsabilities,oﬀeringcompletefunctionalitywhenusedtogether.

Figure17:BlockdiagramofShaderUnit1and2

ShaderUnit1:

Green:Acrossbarwhichdistributestheinputcomingeiterfromtherasterizerorfromtheloopback

Red:Interpolators

Yellow:Aspecialfunctionunit(forfunctionssuchasReciprocal,ReciprocalSquareRoot,etc.)

Cyan:MULchannels

Orange:Aunitfortextureoperations(notthefragmenttextureunit)

Theshaderunitcanperform2operationsperclock:

AMULona3-dimensionalvectorandaspecialfunction,aspecialfunctionandatextureoperation,or2

MULs.

TheoutputofthespecialfunctionunitcangointotheMULchannels.

ThetexturegetsinputfromtheMULunitanddoesLOD(LevelOfDetail)calculations,beforepassing

gmenttextureunitthenperformstheactualsampling

andwritesthedataintoregistersforthesecondshaderunit.

Theshaderunitcansimplypassdataaswell.

ShaderUnit2:

Red:Acrossbar

Cyan:4MULchannels

Gray:4ADDchannels

Yellow:1specialfunctionunit

Thecrossbarsplitstheinputonto5channels(4components,1channelstaysfree).

TheADDunitsareadditionallyconnected,allowingadvancedoperationssuchasadotproductinoneclock.

Again,theshaderu

specialfunctionisused,theMADunitcanperformupto2operationsfromthislist:MUL,ADD,MAD,

DP,oranyotherinstructionbasedontheseoperations.

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

cmpdst,src0,src1,src2

dsxdst,src

dsydst,src

sincosdst.{x|y|xy},src0.{x|y|z|w}

texlddst,src0,src1

Choosesrc1ifsrc0>=ise,parison

isdoneperchannel

Computetherateofchangeintherendertarget’sx-direction

Computetherateofchangeintherendertarget’sy-direction

Computessineandcosine,inradians

Sampleatextureataparticularsampler,usingprovidedtexture

coordinates

Registersinthefragmentprocessorinstructionscanbemodiﬁed(withfewexceptions):

•Negatetheregistervalue

•Taketheabsolutevalueoftheregister

•Maskdestinationregistercomponents

Othertechnicaldetails:

•Thefragmentprocessorscanperformoperationswithin16or32ﬂoatingpointprecision(

unitusesonly16bitprecisionforitscalculationssincethatissuﬃcient)

•ThequadsoperateasSIMDunits

•TheyuseVLIW

•Theyrunupto100softhreadstohidetexturefetchlatency(˜256perquad)

•Afragmentprocessorcanperformupto8operationspercycle/4mathoperationsifthere’satexture

fetchinshader1

Figure18:Possibleoperationspercycle

•Thefragmentprocessorshavea2leveltexturecache

•Thefogunitcanperformfogblendingontheﬁplemented

withﬁxedpointprecisionsincethat’ssuﬃcientforfogandsavesperformance.

Theequation:out=FogColor*fogFraction+SrcColor*(1-fogFraction)

•There’ssupportformultiplerendertargets,thepixelprocessorcanoutputtouptofourseperate

buﬀers(4x4values,color+depth)

3.2.3PixelEngine

Figure19:Apixelengine

Lastinthepipelinearethe16pixelengines(rasteroperators).Eachpixelengineconnectstoaspeciﬁc

helosslesscoloranddepthcompression,thedepthandcolorunits

performdepth,colorandstenciloperationsbeforewritingtheﬁtivatedthepixelengines

alsoperformmultisampleantialiasing.

3.2.4Memory

From“GPUGems2,Chapter30:TheGeForce6SeriesGPUArchitecture”:

“Thememorysystemispartitionedintouptofourindependentmemorypartitions,each

withitsowndynamicrandom-accessmemories(DRAMs).GPUsusestandardDRAMmodules

ratherthancustomRAMtechnologiestotakeadvantageofmarketeconomiesandtherebyreduce

smaller,independentmemorypartitionsallowsthememorysubsystemtooperate

eﬃcderedsurfaces

arestoredintheDRAMs,whiletexturesandinputdatacanbestoredintheDRAMsorin

rindependentmemorypartitionsgivetheGPUawide(256bits),

ﬂexiblememorysubsystem,allowingforstreamingofrelativelysmall(32-byte)memoryaccesses

atnearthe35GB/secphysicallimit.”

3.3Performance

•425MHzinternalgraphicsclock

•550MHzmemoryclock

•256-MBmemorysize

•35.2GByte/secondmemorybandwidth

•600millionvertices/second

•6.4billiontexels/second

•12.8billionpixels/second,renderingz/stencil-only(usefulforshadowvolumesandshadowbuﬀers)

•6four-widefp32vectorMADsperclockcycleinthevertexshader,plusonescalarmultifunction

operation(acomplexmathoperation,suchasasineorreciprocalsquareroot)

•16four-widefp32vectorMADsperclockcycleinthefragmentprocessor,plus16four-widefp32

multipliesperclockcycle

•64pixelsperclockcycleearlyz-cull(rejectrate)

•120+Gﬂopspeak(equaltosix5-GHzPentium4processors)

•Upto120Wenergyconsumption(thecardhastwoadditionalpowerconnectors,thepowersources

arerecommendedtobenolessthan480W)

4ComputationalPrinciples

StreamProcessing:

TypicalCPUs(thevonNeumannarchitecture)suﬀe

verysensitivetosuchbottlenecks,andthereforeneedadiﬀerentarchitecture,theyareessentiallyspecial

purposestreamprocessors.

Astrmisasetofdata

amprocessors,everykerneltakesoneormorestreamsasinputand

outputsoneormorestreams,whileitexecutesitsoperationsoneverysingleelementoftheinputstreams.

Instreamprocessorsyoucanachieveseverallevelsofparallelism:

•Instructionlevelparallelism:kernelsperformhundredsofinstructionsoneverystreamelement,you

achieveparallelismbyperformingindependentinstructionsinparallel

•Datalevelparallelism:kernelsperformthesameinstructionsoneachstreamelement,youachieve

parallelismbyperformingoneinstructiononmanystreamelementsatatime

•Tasklevelparallelism:Havemultiplestreamprocessorsdividetheworkfromonekernel

Streamprocessorsdonotusecachingthesamewaytraditionalprocessorsdosincetheinputdatasetsare

usuallymuchlargerthanmostcachesandthedataisbarelyreused-withGPUsforexamplethedatais

usuallyrenderedandthendiscarded.

WeknowGPUshavetoworkwithlargeamountsofdata,thecomputationsaresimplerbuttheyneed

tobefastandparallel,soitbecomesclearthatthestreamprocessorarchitectureisverywellsuitedforGPUs.

Continuingtheseideas,GPUsemployfollowingstrategiestoincreaseoutput:

Pipelining:Pipeliningdescribestheideaofbreakingdownajobintomultiplecomponentsthateachperform

epipelined,whichmeansthatinsteadofperformingcompleteprocessingofapixel

beforemovingontothenext,youﬁllthepipelinelikeanassemblylinewhereeachcomponentperformsa

eprocessingapixelmaytakemultipleclock

cycles,youstillachieveanoutputofonepixelperclocksinceyouﬁllupthewholepipe.

Parallelism:Duetothenatureofthedata-parallelismcanbeappliedonaper-vertexorper-pixelbasis

-andthetypeofprocessing(highlyrepetitive)GPUsareverysuitableforparallelism,youcouldhavean

unlimitedamountofpipelinesnexttoeachother,aslongastheCPUisabletokeepthembusy.

OtherGPUcharacteristics:

•GPUscanaﬀordlargeamountsofﬂoatingpointcomputationalpowersincetheyhavelowercontrol

overhead

•Theyusededicatedfunctionalunitsforspecializedtaskstoincreasespeeds

•GPUmemorystruggleswithbandwidthlimitations,andthereforeaimsformaximumbandwidthusage,

employingstrategieslikedatacompression,multiplethreadstocopewithlatency,schedulingofDRAM

cyclestominimizeidledata-bustime,etc.

•Cachesaredesignedtosupporteﬀectivestreamingwithlocalreuseofdata,ratherthanimplementing

acachethatachieves99%hitrates(whichisn’tfeasible),GPUcachedesignsassumea90%hitrate

withmanymissesinﬂight

•GPUshavemanydiﬀerentperformanceregimesallwithdiﬀerentcharacteristicsandneedtobede-

signedaccordingly

4.1TheGeforce6800asageneralprocessor

YoucanseetheGeforce6800asageneralprocessorwithalotofﬂoating-pointhorsepowerandhighmemory

bandwidththatcanbeusedforotherapplicationsaswell.

Figure20:AgeneralviewoftheGeforce6800architecture

LookingattheGPUthatway,weget:

•2seriallyrunningprogrammableblockswithfp32capability.

•TheRasterizercanbeseenasaunitthatexpandsthedataintointerpolatedvalues(fromonedata-

”point”tomultiple”fragments”).

•WithMRT(MultipleRenderTargets),thefragmentprocessorcanoutputupto16scalarﬂoating-point

valuesatatime.

•Severalpossibilitiestocontrolthedataﬂowbyusingthevisibilitychecksofthepixelenginesorthe

Z-cullunit

5Thenextstep:theGeforce8800

AftertheGeforce7serieswhichwasacontinuationoftheGeforce6800architecture,Nvidiaintroducedthe

bythedesiretoincreaseperformance,improveimagequalityandfacilitate

programming,theGeforce8800presentedasigniﬁcantevolutionofpastdesigns:auniﬁedshaderarchitec-

ture(Note,thatATIalreadyusedthisarchitecturein2005withtheXBOX360GPU).

Figure21:Fromdedicatedtouniﬁedarchitecture

Figure22:AschematicviewoftheGeforce8800

TheuniﬁedshaderarchitectureoftheGeforce8800essentiallyboilsdowntothefactthatallthediﬀerent

shaderstagesbecomeonesinglestagethatcanhandleallthediﬀerentshaders.

AsyoucanseeinFigure22,insteadofdiﬀerentdedicatedunitswenowhaveasinglestreamingprocessor

familiarunitssuchastherasteroperators(blue,atthebottom)andthetrianglesetup,

stheseunitswenowhaveseveralmanagingunitsthatprepareand

managethedataasitﬂowsintheloop(vertex,geometryandpixelthreadissue,inputassemblerandthread

processor).

Figure23:Thestreamingprocessorarray

xtureprocessorclusterin

turnconsistsof2streamingmultiprocessorsand1texturepipe.

Astreamieamingpro-

cessorsworkwith32-bitscalardata,basedontheideathatshaderprogramsarebecomingmoreandmore

scalar,makingavectorarchitecturemoreineﬃedrivenbyahigh-speedclockthatisseperate

ltiprocessorcan

have768hardwarescheduledthreads,groupedtogetherto24SIMD”warps”(Awarpisagroupofthreads).

Thetexturepipeconsistsof4textureaddressingand8textureﬁormstextureprefetching

andﬁlteringwithoutconsumingALUresources,furtherincreasingeﬃcency.

Itmple,theold

problemofconstantlychangingworkloadandoneshaderstagebecomingaprocessingbottleneckissolved

sincetheunitscanadaptdynamically,nowthattheyareuniﬁed.

Figure24:Workloadbalancingwithbotharchitectures

Withasingleinstructionsetandthesupportoffp32throughoutthewholepipeline,aswellasthesupport

ofnewdatatypes(integercalculations),programmingtheGPUnowbecomeseasieraswell.

6GeneralPurposeProgrammingontheGPU-anexample

WeusethebitonicmergesortalgorithmasanexampleforeﬃcientlyimplementingalgorithmsonaGPU.

Bitonicmergesort:

Bitonicmergesortworksbyrepeatedlybuildingbitoniclistsoutofasetofelementsandsortingthem.A

bitoniclistisaconcatenationoftwomonotoniclists,oneascendingandonedescending.

E.g.:

ListA=(3,4,7,8)

ListB=(6,5,2,1)

ListAB=(3,4,7,8,6,5,2,1)isabitoniclist

ListBA=(6,5,2,1,3,4,7,8)isalsoabitoniclist

Bitoniclistshaveaspecialpropertythatisusedinbitonicmergesort:Supposeyouhaveabitoniclistof

rearrangethelistsothatyougettwohalveswithnelementswhereeachelement(i)of

theﬁrsthalfislessthanorequaltoeachcorrespondingelement(i+n)inthesecondhalf(orgreaterthanor

equal,ifthelistdescendsﬁrstandthenascends)ppensbycom-

ocedureiscalledabitonicmerge.

Bitonicmergesortworksbyrecursivelycreatingandmergingbitonicliststhatincreaseintheirsizeuntil

25illustratestheprocess:

Figure25:Thediﬀerentstagesofthealgorithm

Thesortingprw

sultsinbitonicmergesorthavingacomplexityof

O(nlog

(n)+log(n))whichisworsethanquicksort,butthealgorithmhasnoworst-casescenario(where

quicksortreachesO(n

theoperationscanbeperformedinparalleland

thelengthstaysconstant,nimplementingthisalgorithmon

theGPU,wewanttomakeuseofasmanyresourcesaspossible(bothinparallelaswellasverticallyalong

thepipeline),especiallyconsideringthattheGPUhasshortcomingsaswell,suchasthelackofsupport

mple,simplylettingthefragmentprocessorstagehandleallthe

calculationsmightwork,ble

solutionlookslikethis:

Inthisalgorithm,wehavegroupsofelements(fragments)thathavethesamesortingconditions,while

edrawavertexquadovertwoadjacentgroupsand

setappropriateﬂagsateachcorner,

example,ifwesettheleftcornersto-1andtherightcornersto+1,wecancheckwhereafragmentbelongs

tobysimplylookingatitssign(theinterpolationprocesstakescareofthat).

Figure26:Usingvertexﬂagsandtheinterpolatortodeterminecompareoperations

Next,weneedtodeterminewhichcompareoperationtouseandweneedtolocatethepartneritemto

nagainbeaccomplishedbyusingtheﬂgthecompareoperationtoless-than

andmultiplyingwiththeﬂagvalueimplicitlyﬂipstheoperationtogreater-equalhalfwayacrossthequad.

Locatingthepartneritemhappensbymultiplyingthesignoftheﬂagwithanabsolutevaluethatspeciﬁes

thedistancebetweentheitems.

Inordertosortelementsofanarray,westorethemina2Dtexture.

Eachrowisatend

thequadsovertherowsofthe2Dtextureandusetheinterpolation,wecanmodulatethecomparisonsothe

y,pairsofrowsbecomebitonic

sequencesagainwhichcanbesortedinthesamewaywesortedthecolumnsofthesinglerows,simplyby

transposingthequads.

Asaﬁnaloptimizationwereducetexturefetchingbypackingtwoneighbouringkeypairsintoonefrag-

ment,sincetheshaderoperateson4-vectors.

Performancecomparison:

std:sort:16-BitData,

Pentium43.0GHz

NFullSorted

Sorts/SecKeys/Sec

25682.55.4M

51220.65.4M

10244.75.0M

BitonicMergeSort:16-BitFloatData,

NVIDIAGeforce6800Ultra

NPassesFullSorted

Sorts/SecKey/Sec

25612090.076.1M

51215318.34.8M

10241903.63.8M

GLSL(OpenGLShadingLanguage)codesample,implementingthecombinedpasses0and1forrow-wise

sortingofthebitonicmergesort:

uniformsampler2DPackedData:

//contentsofthetexcoorddata

#defineOwnPosgl_TexCoord[0].xy

#defineSearchDirgl_TexCoord[0].z

#defineCompOpgl_TexCoord[0].w

#defineDistancegl_TexCoord[1].x

#defineStridegl_TexCoord[1].y

#defineHeightgl_TexCoord[1].z

#defineHalfStrideMHalfgl_TexCoord[1].w

voidmain(void)

{

//getself

vec4self=texture2D(PackedData,OwnPos);

//restoresignofsearchdirectionandassemblevectortopartner

vec2adr=vec2((SearchDir<0.0)?-Distance:Distance,0.0);

//getthepartner

vec4partner=texture2D(PackedData,OwnPos+adr);

//switchascending/descendingsortforeveryotherrow

//bymodifyingcomparisonflag

floatcompare=CompOp*-(mod(floor(gl_TexCoord[0].y*Height),Stride)-HalfStrideMHalf);

//xandyarethekeysofthetwoitems

//-->multiplywithcomparisonflag

vec4keys=compare*vec4(self.x,self.y,partner.x,partner.y);

//comparethekeysandstoreaccordingly

//zandwaretheindices

//-->justcopythemaccordingly

vec4result;

=(keys.x

=(keys.y

//dopass0

compare*=adr.x;

gl_FragColor=(result.x*compare

}

7Currentandfuturedevelopments

Nvidia’scurrenttopofthelinemodelofgraphicscardsistheGeforceGTX280(GTX200series),an

evolutionoftheuniﬁedshaderarchitectureoftheGeforce8800,sportingalmostdoubletheshadercount

(from128to240)nchdatewasthe17thofJune2008.

ATI(nowmergedintoAMD)followedin2007withitsﬁrstuniﬁedshaderGPUforthePC(Radeon

HD2900XT),renttop

ofthelinemodelistheRadeonHD3870X2(whichactuallysports2GPUsononecard)whichwasreleased

inJanuary2008.

ATI/AMDaresoontofollowupwithananswertoNvidia’sGTX280:theRadeonHD4870(slatedfor

somewherearoundJuly2008).

Withtheadventoftheuniﬁedshaderarchitecture,thetopicofgeneral-purposecomputingonaGPUhas

,GPUshavemadetheirwayintonon-graphicsﬁeldsasvariedas

audiosignalprocessingandweatherforecasting.

BothATI/AMDandNvidiahavemadeeﬀ

releasedCTM(CloseToMetal),ewritingthesoftware,CTM’s

commercialsuccessorAMDStreamSDKwasreleasedinDecember2007,nowprovidingadditionalhighlevel

toolsforgeneral-purposeaccesstoAMDgraphicshardware.

NvidiainitiallyreleasedtheCUDA(ComputeUniﬁedDeviceArchitecture)SDKinFebruary2007,aC

lauary2008,Nvidia

boughtAgeiaandtheirPhysXengine(aproprietaryrealtimephysicsenginemiddlewareSDK)andinte-

grateditintotheirCUDAframework.

tuallyscalewell

beyondMoore’slaw,charapiddevelopmentwecan

certainlyexpecttoseequitesomeinterestingthingstocomeinthisﬁeldofprocessing.

References

[1]WikipediaentryonGPUs

/wiki/GPU

[2]KeesHuizing,Han-WeiShen:“TheGraphicsRenderingPipeline”

keesh/ow/2IV40/

[3]CyrilZeller:“IntroductiontotheHardwareGraphicsPipeline”,PresentationatACMSIGGRAPH

2005

/developer/presentations/2005/I3D/I3D_05_

[4]ExtremeTech3DPipelineTutorial

/article2/0,1697,9722,

[5]AshuRege:“Introductionto3DGraphicsforGames”

/docs/IO/11278/

[6]DirectXDeveloperCenter:“TheDirect3DTransformationPipeline”

/en-us/library/bb206260(VS.85).aspx

[7]MarkColbert:“GPUArchitecture&CG”

/gpuseminar/

[8]GPUGems2,Chapter30:“TheGeForce6SeriesGPUArchitecture”

/developer/GPU_Gems_2/GPU_Gems2_

[9]IEEEMicro,Volume25,Issue2(March2005):“TheGeForce6800”

/?id=1069760

[10]:“NV40-TechnikimDetail”

/artikel/nv40_pipeline/

[11]:“NVIDIAGeForce6800Ultra(NV40)”

/articles2/gffx/

[12]AustinRobison,AbeWinter:“AnOverviewofGraphicsProcessingHardware”

robison/src/gpu_

[13]JohnMontrym,HenryMoreton:“NVIDIAGeForce6800”,HotChips16

/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_

[14]AjitDatar,ApurvaPadhye:“GraphicsProcessingUnitArchitecture”

data0003/Talks/

[15]SvenSchenk:“EineEinfuehrungindieArchitekturmodernerGraphikprozessoren”

/Lehre/Seminar0506/

[16]ThomasScottCrow:“EvolutionoftheGraphicalProcessingUnit”

fredh/papers/thesis/023-crow/

[17]DirectXDeveloperCenter:“AsmShaderReference”

/en-us/library/bb219840(VS.85).aspx

[18]ErikLindholm,StuartOberman:“NVIDIAGeForce8800GPU”

/archives/hc19/2_Mon/HC19.02/

[19]:“SayHelloToDirectX10,Or128ALUsInAction:NVIDIAGeForce8800GTX(G80)”

/articles2/video/

[20]RichardHough,RichardYu:“GPUArchitecture”

/courses/ece685/slides/

[21]TechnicalBrief:“NVIDIAGeForce8800GPUArchitectureOverview”

/object/IO_

[22]GPUGems2,Chapter46:“ImprovedGPUSorting”

[23]TimPurcell:“SortingandSearching”,SIGGRAPH2005GPGPUCOURSE

/s2005/slides/

[24]PeterKipfer,MarkSegal,RuedigerWestermann:“UberFlow:AGPU-BasedParticleEngine”

/previous/www_2004/Presentations/

[25]WikipediaentryonNvidia

/wiki/Nvidia_Corporation

[26]WikipediaentryonATI

/wiki/ATI_Technologies_Inc.

[27]WikipediaentryonCUDA

/wiki/CUDA

[28]WikipediaentryonCTM

/wiki/Close_to_Metal

[29]WilliamMark,HenryMoreton:“3DGraphicsArchitectureTutorial”

/users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_

2024年7月22日发(作者：宜方仪)

GPUs-GraphicsProcessingUnits

MinhTriDoDinh

-Dinh@

VertiefungsseminarArchitekturvonProzessoren,SS2008

InstituteofComputerScience,UniversityofInnsbruck

July7,2008

Thisores

theirarchitectureandunderlyingdesignprinciples,usingchipsfromNvidia’s”Geforce”seriesas

examples.

1Introduction

BeforewediveintothearchitecturaldetailsofsomeexampleGPUs,we’llhavealookatsomebasicconcepts

ofgraphicsprocessingand3Dgraphics,whichwillmakeiteasierforustounderstandthefunctionalityof

GPUs

1.1WhatisaGPU?

AGPU(GraphicsProcessingUnit)isessentiallyadedicatedhardwaredevicethatisresponsiblefortrans-

paper,wewillfocusonthe3Dgraphics,sincethatis

whatmodernGPUsaremainlydesignedfor.

1.2Theanatomyofa3Dscene

Figure1:A3Dscene

3Dscene:Acollectionof3Dobjectsandlights.

Figure2:Object,triangleandvertices

3Dobjects:Arbitraryobjects,nsarecomposedof

vertices.

Vertex:APointwithspatialcoordinatesandotherinformationsuchascolorandtexturecoordinates.

Figure3:Acubewithacheckerboardtexture

Texture:Animagethatismappedontothesurfaceofa3Dobject,whichcreatestheillusionofanobject

ticesofanobjectstoretheso-calledtexturecoordinates

(2-dimensionalvectors)thatspecifyhowatextureismappedontoanygivensurface.

Figure4:Texturecoordinatesofatrianglewithabricktexture

Inordertotranslatesucha3Dscenetoa2Dimage,thedatahastogothroughseveralstagesofa”Graphics

Pipeline”

1.3TheGraphicsPipeline

Figure5:The3DGraphicsPipeline

First,amongsomeotheroperations,wehavetotranslatethedatathatisprovidedbytheapplicationfrom

3Dto2D.

1.3.1GeometryStage

Thisstageisalsoreferredtoasthe”TransformandLighting”rtotranslatethescenefrom

3Dto2D,alltheobjectsofasceneneedtobetransformedtovariousspaces-eachwithitsowncoordinate

ransformationsareappliedona

vertex-to-vertexbasis.

MathematicalPrinciples

Apointin3Dspaceusuallyhas3coordinates,epusing3-dimensional

vectorsforthetransformationcalculations,werunintotheproblemthatdiﬀerenttransformationsrequire

diﬀ:translatingavertexrequiresadditionwithavectorwhilerotatingavertexrequires

multiplicationwitha3x3matrix).Wecircumventthisproblemsimplybyextendingthe3-dimensionalvector

byanothercoordinate(thew-coordinate),y,

everytransformationcanbeappliedbymultiplyingthevectorwithaspeciﬁc4x4matrix,makingcalculations

mucheasier.

Figure6:Transformationmatricesfortranslation,rotationandscaling

Lighting,theothermajorpartofthispipelinestageiscalculatedusingthenormalvectorsofthesurfaces

inationwiththepositionofthecameraandthepositionofthelightsource,onecan

computethelightingpropertiesofagivenvertex.

Figure7:Calculatinglighting

Fortransformation,westartoutinthemodelspacewhereeachobject(model)hasitsowncoordinate

system,whichfacilitatesgeometrictransformationssuchastranslation,rotationandscaling.

Afterthat,wemoveontotheworldspace,whereallobjectswithinthescenehaveauniﬁedcoordinate

system.

Figure8:Worldspacecoordinates

Thenextstepisthetransformationintoviewspace,whichlocatesacameraintheworldspaceandthen

transformsthescene,suchthatthecameraisattheoriginoftheviewspace,lookingstraightintothe

andeﬁneaviewvolume,theso-calledviewfrustrum,whichwillbeusedto

decidewhatactuallyisvisibleandneedstoberendered.

Figure9:Thecamera/eye,theviewfrustrumanditsclippingplanes

Afterthat,theverticesaretransformedintoclipspaceandassembledintoprimitives(trianglesorlines),

bjectsthatareoutsideofthefrustrumdon’tneedto

berenderedandcanbediscarded,objectsthatarepartiallyinsidethefrustrumneedtobeclipped(hence

thename),andnewverticeswithpropertextureandcolorcoordinatesneedtobecreated.

Aperspectivedivideisthenperformed,whichtransformsthefrustrumintoacubewithnormalized

coordinates(xandybetween-1and1,zbetween0and1)whiletheobjectsinsidethefrustrumarescaled

thisnormalizedcubefacilitatesclippingoperationsandsetsuptheprojectioninto2D

space(thecubesimplyneedstobe”ﬂattened”).

Figure10:Transformingintoclipspace

Finally,wecanmoveintoscreenspacewherexandycoordinatesaretransformedforproper2Ddisplay

(inagivenwindow).(Notethatthez-coordinateofavertexisretainedforlaterdepthoperations)

Figure11:Fromviewspacetoscreenspace

Note,thatthetexturecoordinatesneedtobetransformedaswellandadditionallybesidesclipping,sur-

facesthataren’tvisible(ksideofacube)areremovedaswell(so-calledbackfaceculling).

Theresultisa2Dimageofthe3Dscene,andwecanmoveontothenextstage.

1.3.2RasterizationStage

needstotraversethe2Dimageandconvert

thedataintoanumberof”pixel-candidates”,so-calledfragments,whichmaylaterbecomepixelsofthe

ﬁentisadatastructurethatcontainsattributessuchasposition,color,depth,texture

coordinates,eneratedbycheckingwhichpartofanygivenprimitiveintersectswithwhichpixel

gmentintersectswithaprimitive,butnotanyofitsvertices,theattributesofthat

fragmenthavetobeadditionallycalculatedbyinterpolatingtheattributesbetweenthevertices.

Figure12:Rasterizingatriangleandinterpolatingitscolorvalues

Afterthat,furtherstepscanbemadetoobtaintheﬁarecalculatedbycombining

textureswithotherattributessuchascolorandlightingorbycombiningafragmentwitheitheranother

translucentfragment(so-calledalphablending)oroptionalfog(anothergraphicaleﬀect).

Visibilitychecksareperformedsuchas:

•

Scissortest(checkingvisibilityagainstarectangularmask)

Stenciltest(similartoscissortest,onlyagainstarbitrarypixelmasksinabuﬀer)

Depthtest(comparingthez-coordinateoffragments,discardingthosewhicharefurtheraway)

Alphatest(checkingvisibilityagainsttranslucentfragments)

Additionalprocedureslikeanti-aliasingcanbeappliedbeforeweachievetheﬁnalresult:anumberof

pixelsthatcanbewrittenintomemoryforlaterdisplay.

Thisconcludesourshorttourthroughthegraphicspipeline,whichhopefullygivesusabetterideaofwhat

kindoffunctionalitywillberequiredofaGPU.

2EvolutionoftheGPU

SomehistoricalkeypointsinthedevelopmentoftheGPU:

•Eﬀortsforrealtimegraphicshavebeenmadeasearlyas1944(MIT’sproject”Whirlwind”)

•Inthe1980s,hardwaresimilartomodernGPUsbegantoshowupintheresearchcommunity(“Pixel-

Planes”,aaparallelsystemforrasterizingandtexture-mapping3Dgeometry

•Graphicchipsintheearly1980swereverylimitedintheirfunctionality

•Inthelate1980sandearly1990s,high-speed,general-purposemicroprocessorsbecamepopularfor

implementinghigh-endGPUs(nstruments’TMS340)

•1985Theﬁrstmass-marketgraphicsacceleratorwasincludedintheCommodoreAmiga

•1991S3introducedtheﬁrstsinglechip2D-accelerator,theS386C911

•1995Nvidiareleasesoneoftheﬁrst3Daccelerators,theNV1

•1999Nvidia’sGeforce256istheﬁrstGPUtoimplementTransformandLightinginHardware

•2001NvidiaimplementstheﬁrstprogrammableshaderunitswiththeGeforce3

•2005ATIdevelopstheﬁrstGPUwithuniﬁedshaderarchitecturewiththeATIXenosfortheXBox

360

•2006NvidialaunchestheﬁrstuniﬁedshaderGPUforthePCwiththeGeforce8800

3FromTheorytoPractice-theGeforce6800

3.1Overview

ModernGPUscloselyfollowthelayoutofthegraphicspipelinedescribedintheﬁvidia’s

Geforce6800asanexamplewewillhaveacloserlookatthearchitectureofmoderndayGPUs.

Sincebeingfoundedin1993,thecompanyNVIDIAhasbecomeoneofthebiggestmanufacturersofGPUs

(besidesATI),havingreleasedimportantchipssuchastheGeforce256,andtheGeforce3.

Launchedin2004,theGeforce6800belongstotheGeforce6series,Nvidia’ssixthgenerationofgraphics

chipsetsandthefourthgenerationthatfeaturedprogrammability(moreonthatlater).

ThefollowingimageshowsaschematicviewoftheGeforce6800anditsfunctionalunits.

Figure13:SchematicviewoftheGeforce6800

Youcanalreadyseehoweachofthefunctionalunitscorrespondtothestagesofthegraphicspipeline.

Westartwithsixparallelvertexprocessorsthatreceivedatafromthehost(theCPU)andperformoper-

ationssuchastransformationandlighting.

Next,theoutputgoesintothetrianglesetupstagewhichtakescareofprimitiveassembly,cullingand

clipping,orce6800hasanadditional

Z-cullunitwhichallowstoperformanearlyfragmentvisibilitycheckbasedondepth,furtherimprovingthe

eﬃciency.

Wethenmoveontothesixteenfragmentprocessorswhichoperatein4parallelunitsandcomputesthe

outputcolorsofeachfragment.

Thefragmentcrossbarisalinkingelementthatisbasicallyresponsiblefordirectingoutputpixelstoany

availablepixelengine(alsocalledROP,shortforRasterOperator),thusavoidingpipelinestalls.

The16pixelenginesaretheﬁnalstageofprocessing,andperformoperationssuchasalphablending,

depthtests,etc.,beforedeliveringtheﬁnalpixeltotheframebuﬀer.

3.2InDetail

Figure14:AmoredetailedviewoftheGeforce6800

WhilemostpartsoftheGPUareﬁxedfunctionunits,thevertexandfragmentprocessorsoftheGeforce

6800oﬀerprogrammabilitywhichwasﬁrstintroducedtothegeforcechipsetlinewiththegeforce3(2001).

We’llhaveamoredetailedlookattheunitsinthefollowingsections.

3.2.1VertexProcessor

Figure15:Avertexprocessor

Thevertexprocessorsaretheprogrammableunitsresponsibleforallthevertextransformationsandat-

eratewith4-dimensionaldatavectorscorrespondingwiththeaforementioned

homogeneouscoordinatesofavertex,using32bitspercoordinate(hencethe128bitsofaregister).Instruc-

tionsare123bitslongandarestoredintheInstructionRAM.

Thedatapathofavertexprocessorconsistsof:

•Amultiply-addunitfor4-dimensionalvectors

•Ascalarspecialfunctionunit

•Atextureunit

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

dp4dst,src0,src1

expdst,src

dstdest,src0,src1

nrmdst,src

rsqdst,src

Computesthefour-componentdotproductofthesourceregisters

Providesexponential2

Calculatesadistancevector

Normalizea3Dvector

Computesthereciprocalsquareroot(positiveonly)ofthesource

scalar

Registersinthevertexprocessorinstructionscanbemodiﬁed(withfewexceptions):

•

Negatetheregistervalue

Taketheabsolutevalueoftheregister

Swizzling(copyanysourceregistercomponenttoanytemporaryregistercomponent)

Maskdestinationregistercomponents

Othertechnicaldetails:

•

VertexprocessorsareMIMDunits(MultipleInstructionMultipleData)

TheyuseVLIW(VeryLongInstructionWords)

Theyoperatewith32-bitﬂoatingpointprecision

Eachvertexprocessorrunsupto3threadstohidelatency

Eachvertexprocessorcanperformafour-wideMAD(Multiply-Add)andaspecialfunctioninone

cycle

3.2.2FragmentProcessor

Figure16:Afragmentprocessor

egroupedto4biggerunitswhichoperatesimulta-

neouslyon4fragmentseach(aso-calledquad).Theycantakeposition,color,depth,fogaswellasother

arbitrary4-dimensionalattributesasinput.

Thedatapathconsistsof:

•AnInterpolationblockforattributes

•2vectormath(shader)units,eachwithslightlydiﬀerentfunctionality

•Afragmenttextureunit

Superscalarity:

Afragmentprocessorworkswith4-vectors(vector-orientedinstructionset),wheresometimescomponentsof

thevectorneedbetreatedseperately(,alpha).Thus,thefragmentprocessorsupportsco-issueing

ofthedata,whichmeanssplittingthevectorinto2partsandexecutingdiﬀerentoperationsontheminthe

orts3-1and2-2splitting(2-2co-issuewasn’tpossibleearlier).

Additionally,italsofeaturesdualissue,whichmeansexecutingdiﬀerentoperationsonthe2vectormath

unitsinthesameclock.

TextureUnit:

Thetextureunitisaﬂoating-pointtextureprocessorwhichfetchesandﬁn-

nectedtoalevel1texturecache(whichstorespartsofthetexturesthatareused).

Shaderunits1and2:

Eachshaderunitislimitedinitsabilities,oﬀeringcompletefunctionalitywhenusedtogether.

Figure17:BlockdiagramofShaderUnit1and2

ShaderUnit1:

Green:Acrossbarwhichdistributestheinputcomingeiterfromtherasterizerorfromtheloopback

Red:Interpolators

Yellow:Aspecialfunctionunit(forfunctionssuchasReciprocal,ReciprocalSquareRoot,etc.)

Cyan:MULchannels

Orange:Aunitfortextureoperations(notthefragmenttextureunit)

Theshaderunitcanperform2operationsperclock:

AMULona3-dimensionalvectorandaspecialfunction,aspecialfunctionandatextureoperation,or2

MULs.

TheoutputofthespecialfunctionunitcangointotheMULchannels.

ThetexturegetsinputfromtheMULunitanddoesLOD(LevelOfDetail)calculations,beforepassing

gmenttextureunitthenperformstheactualsampling

andwritesthedataintoregistersforthesecondshaderunit.

Theshaderunitcansimplypassdataaswell.

ShaderUnit2:

Red:Acrossbar

Cyan:4MULchannels

Gray:4ADDchannels

Yellow:1specialfunctionunit

Thecrossbarsplitstheinputonto5channels(4components,1channelstaysfree).

TheADDunitsareadditionallyconnected,allowingadvancedoperationssuchasadotproductinoneclock.

Again,theshaderu

specialfunctionisused,theMADunitcanperformupto2operationsfromthislist:MUL,ADD,MAD,

DP,oranyotherinstructionbasedontheseoperations.

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

cmpdst,src0,src1,src2

dsxdst,src

dsydst,src

sincosdst.{x|y|xy},src0.{x|y|z|w}

texlddst,src0,src1

Choosesrc1ifsrc0>=ise,parison

isdoneperchannel

Computetherateofchangeintherendertarget’sx-direction

Computetherateofchangeintherendertarget’sy-direction

Computessineandcosine,inradians

Sampleatextureataparticularsampler,usingprovidedtexture

coordinates

Registersinthefragmentprocessorinstructionscanbemodiﬁed(withfewexceptions):

•Negatetheregistervalue

•Taketheabsolutevalueoftheregister

•Maskdestinationregistercomponents

Othertechnicaldetails:

•Thefragmentprocessorscanperformoperationswithin16or32ﬂoatingpointprecision(

unitusesonly16bitprecisionforitscalculationssincethatissuﬃcient)

•ThequadsoperateasSIMDunits

•TheyuseVLIW

•Theyrunupto100softhreadstohidetexturefetchlatency(˜256perquad)

•Afragmentprocessorcanperformupto8operationspercycle/4mathoperationsifthere’satexture

fetchinshader1

Figure18:Possibleoperationspercycle

•Thefragmentprocessorshavea2leveltexturecache

•Thefogunitcanperformfogblendingontheﬁplemented

withﬁxedpointprecisionsincethat’ssuﬃcientforfogandsavesperformance.

Theequation:out=FogColor*fogFraction+SrcColor*(1-fogFraction)

•There’ssupportformultiplerendertargets,thepixelprocessorcanoutputtouptofourseperate

buﬀers(4x4values,color+depth)

3.2.3PixelEngine

Figure19:Apixelengine

Lastinthepipelinearethe16pixelengines(rasteroperators).Eachpixelengineconnectstoaspeciﬁc

helosslesscoloranddepthcompression,thedepthandcolorunits

performdepth,colorandstenciloperationsbeforewritingtheﬁtivatedthepixelengines

alsoperformmultisampleantialiasing.

3.2.4Memory

From“GPUGems2,Chapter30:TheGeForce6SeriesGPUArchitecture”:

“Thememorysystemispartitionedintouptofourindependentmemorypartitions,each

withitsowndynamicrandom-accessmemories(DRAMs).GPUsusestandardDRAMmodules

ratherthancustomRAMtechnologiestotakeadvantageofmarketeconomiesandtherebyreduce

smaller,independentmemorypartitionsallowsthememorysubsystemtooperate

eﬃcderedsurfaces

arestoredintheDRAMs,whiletexturesandinputdatacanbestoredintheDRAMsorin

rindependentmemorypartitionsgivetheGPUawide(256bits),

ﬂexiblememorysubsystem,allowingforstreamingofrelativelysmall(32-byte)memoryaccesses

atnearthe35GB/secphysicallimit.”

3.3Performance

•425MHzinternalgraphicsclock

•550MHzmemoryclock

•256-MBmemorysize

•35.2GByte/secondmemorybandwidth

•600millionvertices/second

•6.4billiontexels/second

•12.8billionpixels/second,renderingz/stencil-only(usefulforshadowvolumesandshadowbuﬀers)

•6four-widefp32vectorMADsperclockcycleinthevertexshader,plusonescalarmultifunction

operation(acomplexmathoperation,suchasasineorreciprocalsquareroot)

•16four-widefp32vectorMADsperclockcycleinthefragmentprocessor,plus16four-widefp32

multipliesperclockcycle

•64pixelsperclockcycleearlyz-cull(rejectrate)

•120+Gﬂopspeak(equaltosix5-GHzPentium4processors)

•Upto120Wenergyconsumption(thecardhastwoadditionalpowerconnectors,thepowersources

arerecommendedtobenolessthan480W)

4ComputationalPrinciples

StreamProcessing:

TypicalCPUs(thevonNeumannarchitecture)suﬀe

verysensitivetosuchbottlenecks,andthereforeneedadiﬀerentarchitecture,theyareessentiallyspecial

purposestreamprocessors.

Astrmisasetofdata

amprocessors,everykerneltakesoneormorestreamsasinputand

outputsoneormorestreams,whileitexecutesitsoperationsoneverysingleelementoftheinputstreams.

Instreamprocessorsyoucanachieveseverallevelsofparallelism:

•Instructionlevelparallelism:kernelsperformhundredsofinstructionsoneverystreamelement,you

achieveparallelismbyperformingindependentinstructionsinparallel

•Datalevelparallelism:kernelsperformthesameinstructionsoneachstreamelement,youachieve

parallelismbyperformingoneinstructiononmanystreamelementsatatime

•Tasklevelparallelism:Havemultiplestreamprocessorsdividetheworkfromonekernel

Streamprocessorsdonotusecachingthesamewaytraditionalprocessorsdosincetheinputdatasetsare

usuallymuchlargerthanmostcachesandthedataisbarelyreused-withGPUsforexamplethedatais

usuallyrenderedandthendiscarded.

WeknowGPUshavetoworkwithlargeamountsofdata,thecomputationsaresimplerbuttheyneed

tobefastandparallel,soitbecomesclearthatthestreamprocessorarchitectureisverywellsuitedforGPUs.

Continuingtheseideas,GPUsemployfollowingstrategiestoincreaseoutput:

Pipelining:Pipeliningdescribestheideaofbreakingdownajobintomultiplecomponentsthateachperform

epipelined,whichmeansthatinsteadofperformingcompleteprocessingofapixel

beforemovingontothenext,youﬁllthepipelinelikeanassemblylinewhereeachcomponentperformsa

eprocessingapixelmaytakemultipleclock

cycles,youstillachieveanoutputofonepixelperclocksinceyouﬁllupthewholepipe.

Parallelism:Duetothenatureofthedata-parallelismcanbeappliedonaper-vertexorper-pixelbasis

-andthetypeofprocessing(highlyrepetitive)GPUsareverysuitableforparallelism,youcouldhavean

unlimitedamountofpipelinesnexttoeachother,aslongastheCPUisabletokeepthembusy.

OtherGPUcharacteristics:

•GPUscanaﬀordlargeamountsofﬂoatingpointcomputationalpowersincetheyhavelowercontrol

overhead

•Theyusededicatedfunctionalunitsforspecializedtaskstoincreasespeeds

•GPUmemorystruggleswithbandwidthlimitations,andthereforeaimsformaximumbandwidthusage,

employingstrategieslikedatacompression,multiplethreadstocopewithlatency,schedulingofDRAM

cyclestominimizeidledata-bustime,etc.

•Cachesaredesignedtosupporteﬀectivestreamingwithlocalreuseofdata,ratherthanimplementing

acachethatachieves99%hitrates(whichisn’tfeasible),GPUcachedesignsassumea90%hitrate

withmanymissesinﬂight

•GPUshavemanydiﬀerentperformanceregimesallwithdiﬀerentcharacteristicsandneedtobede-

signedaccordingly

4.1TheGeforce6800asageneralprocessor

YoucanseetheGeforce6800asageneralprocessorwithalotofﬂoating-pointhorsepowerandhighmemory

bandwidththatcanbeusedforotherapplicationsaswell.

Figure20:AgeneralviewoftheGeforce6800architecture

LookingattheGPUthatway,weget:

•2seriallyrunningprogrammableblockswithfp32capability.

•TheRasterizercanbeseenasaunitthatexpandsthedataintointerpolatedvalues(fromonedata-

”point”tomultiple”fragments”).

•WithMRT(MultipleRenderTargets),thefragmentprocessorcanoutputupto16scalarﬂoating-point

valuesatatime.

•Severalpossibilitiestocontrolthedataﬂowbyusingthevisibilitychecksofthepixelenginesorthe

Z-cullunit

5Thenextstep:theGeforce8800

AftertheGeforce7serieswhichwasacontinuationoftheGeforce6800architecture,Nvidiaintroducedthe

bythedesiretoincreaseperformance,improveimagequalityandfacilitate

programming,theGeforce8800presentedasigniﬁcantevolutionofpastdesigns:auniﬁedshaderarchitec-

ture(Note,thatATIalreadyusedthisarchitecturein2005withtheXBOX360GPU).

Figure21:Fromdedicatedtouniﬁedarchitecture

Figure22:AschematicviewoftheGeforce8800

TheuniﬁedshaderarchitectureoftheGeforce8800essentiallyboilsdowntothefactthatallthediﬀerent

shaderstagesbecomeonesinglestagethatcanhandleallthediﬀerentshaders.

AsyoucanseeinFigure22,insteadofdiﬀerentdedicatedunitswenowhaveasinglestreamingprocessor

familiarunitssuchastherasteroperators(blue,atthebottom)andthetrianglesetup,

stheseunitswenowhaveseveralmanagingunitsthatprepareand

managethedataasitﬂowsintheloop(vertex,geometryandpixelthreadissue,inputassemblerandthread

processor).

Figure23:Thestreamingprocessorarray

xtureprocessorclusterin

turnconsistsof2streamingmultiprocessorsand1texturepipe.

Astreamieamingpro-

cessorsworkwith32-bitscalardata,basedontheideathatshaderprogramsarebecomingmoreandmore

scalar,makingavectorarchitecturemoreineﬃedrivenbyahigh-speedclockthatisseperate

ltiprocessorcan

have768hardwarescheduledthreads,groupedtogetherto24SIMD”warps”(Awarpisagroupofthreads).

Thetexturepipeconsistsof4textureaddressingand8textureﬁormstextureprefetching

andﬁlteringwithoutconsumingALUresources,furtherincreasingeﬃcency.

Itmple,theold

problemofconstantlychangingworkloadandoneshaderstagebecomingaprocessingbottleneckissolved

sincetheunitscanadaptdynamically,nowthattheyareuniﬁed.

Figure24:Workloadbalancingwithbotharchitectures

Withasingleinstructionsetandthesupportoffp32throughoutthewholepipeline,aswellasthesupport

ofnewdatatypes(integercalculations),programmingtheGPUnowbecomeseasieraswell.

6GeneralPurposeProgrammingontheGPU-anexample

WeusethebitonicmergesortalgorithmasanexampleforeﬃcientlyimplementingalgorithmsonaGPU.

Bitonicmergesort:

Bitonicmergesortworksbyrepeatedlybuildingbitoniclistsoutofasetofelementsandsortingthem.A

bitoniclistisaconcatenationoftwomonotoniclists,oneascendingandonedescending.

E.g.:

ListA=(3,4,7,8)

ListB=(6,5,2,1)

ListAB=(3,4,7,8,6,5,2,1)isabitoniclist

ListBA=(6,5,2,1,3,4,7,8)isalsoabitoniclist

Bitoniclistshaveaspecialpropertythatisusedinbitonicmergesort:Supposeyouhaveabitoniclistof

rearrangethelistsothatyougettwohalveswithnelementswhereeachelement(i)of

theﬁrsthalfislessthanorequaltoeachcorrespondingelement(i+n)inthesecondhalf(orgreaterthanor

equal,ifthelistdescendsﬁrstandthenascends)ppensbycom-

ocedureiscalledabitonicmerge.

Bitonicmergesortworksbyrecursivelycreatingandmergingbitonicliststhatincreaseintheirsizeuntil

25illustratestheprocess:

Figure25:Thediﬀerentstagesofthealgorithm

Thesortingprw

sultsinbitonicmergesorthavingacomplexityof

O(nlog

(n)+log(n))whichisworsethanquicksort,butthealgorithmhasnoworst-casescenario(where

quicksortreachesO(n

theoperationscanbeperformedinparalleland

thelengthstaysconstant,nimplementingthisalgorithmon

theGPU,wewanttomakeuseofasmanyresourcesaspossible(bothinparallelaswellasverticallyalong

thepipeline),especiallyconsideringthattheGPUhasshortcomingsaswell,suchasthelackofsupport

mple,simplylettingthefragmentprocessorstagehandleallthe

calculationsmightwork,ble

solutionlookslikethis:

Inthisalgorithm,wehavegroupsofelements(fragments)thathavethesamesortingconditions,while

edrawavertexquadovertwoadjacentgroupsand

setappropriateﬂagsateachcorner,

example,ifwesettheleftcornersto-1andtherightcornersto+1,wecancheckwhereafragmentbelongs

tobysimplylookingatitssign(theinterpolationprocesstakescareofthat).

Figure26:Usingvertexﬂagsandtheinterpolatortodeterminecompareoperations

Next,weneedtodeterminewhichcompareoperationtouseandweneedtolocatethepartneritemto

nagainbeaccomplishedbyusingtheﬂgthecompareoperationtoless-than

andmultiplyingwiththeﬂagvalueimplicitlyﬂipstheoperationtogreater-equalhalfwayacrossthequad.

Locatingthepartneritemhappensbymultiplyingthesignoftheﬂagwithanabsolutevaluethatspeciﬁes

thedistancebetweentheitems.

Inordertosortelementsofanarray,westorethemina2Dtexture.

Eachrowisatend

thequadsovertherowsofthe2Dtextureandusetheinterpolation,wecanmodulatethecomparisonsothe

y,pairsofrowsbecomebitonic

sequencesagainwhichcanbesortedinthesamewaywesortedthecolumnsofthesinglerows,simplyby

transposingthequads.

Asaﬁnaloptimizationwereducetexturefetchingbypackingtwoneighbouringkeypairsintoonefrag-

ment,sincetheshaderoperateson4-vectors.

Performancecomparison:

std:sort:16-BitData,

Pentium43.0GHz

NFullSorted

Sorts/SecKeys/Sec

25682.55.4M

51220.65.4M

10244.75.0M

BitonicMergeSort:16-BitFloatData,

NVIDIAGeforce6800Ultra

NPassesFullSorted

Sorts/SecKey/Sec

25612090.076.1M

51215318.34.8M

10241903.63.8M

GLSL(OpenGLShadingLanguage)codesample,implementingthecombinedpasses0and1forrow-wise

sortingofthebitonicmergesort:

uniformsampler2DPackedData:

//contentsofthetexcoorddata

#defineOwnPosgl_TexCoord[0].xy

#defineSearchDirgl_TexCoord[0].z

#defineCompOpgl_TexCoord[0].w

#defineDistancegl_TexCoord[1].x

#defineStridegl_TexCoord[1].y

#defineHeightgl_TexCoord[1].z

#defineHalfStrideMHalfgl_TexCoord[1].w

voidmain(void)

{

//getself

vec4self=texture2D(PackedData,OwnPos);

//restoresignofsearchdirectionandassemblevectortopartner

vec2adr=vec2((SearchDir<0.0)?-Distance:Distance,0.0);

//getthepartner

vec4partner=texture2D(PackedData,OwnPos+adr);

//switchascending/descendingsortforeveryotherrow

//bymodifyingcomparisonflag

floatcompare=CompOp*-(mod(floor(gl_TexCoord[0].y*Height),Stride)-HalfStrideMHalf);

//xandyarethekeysofthetwoitems

//-->multiplywithcomparisonflag

vec4keys=compare*vec4(self.x,self.y,partner.x,partner.y);

//comparethekeysandstoreaccordingly

//zandwaretheindices

//-->justcopythemaccordingly

vec4result;

=(keys.x

=(keys.y

//dopass0

compare*=adr.x;

gl_FragColor=(result.x*compare

}

7Currentandfuturedevelopments

Nvidia’scurrenttopofthelinemodelofgraphicscardsistheGeforceGTX280(GTX200series),an

evolutionoftheuniﬁedshaderarchitectureoftheGeforce8800,sportingalmostdoubletheshadercount

(from128to240)nchdatewasthe17thofJune2008.

ATI(nowmergedintoAMD)followedin2007withitsﬁrstuniﬁedshaderGPUforthePC(Radeon

HD2900XT),renttop

ofthelinemodelistheRadeonHD3870X2(whichactuallysports2GPUsononecard)whichwasreleased

inJanuary2008.

ATI/AMDaresoontofollowupwithananswertoNvidia’sGTX280:theRadeonHD4870(slatedfor

somewherearoundJuly2008).

Withtheadventoftheuniﬁedshaderarchitecture,thetopicofgeneral-purposecomputingonaGPUhas

,GPUshavemadetheirwayintonon-graphicsﬁeldsasvariedas

audiosignalprocessingandweatherforecasting.

BothATI/AMDandNvidiahavemadeeﬀ

releasedCTM(CloseToMetal),ewritingthesoftware,CTM’s

commercialsuccessorAMDStreamSDKwasreleasedinDecember2007,nowprovidingadditionalhighlevel

toolsforgeneral-purposeaccesstoAMDgraphicshardware.

NvidiainitiallyreleasedtheCUDA(ComputeUniﬁedDeviceArchitecture)SDKinFebruary2007,aC

lauary2008,Nvidia

boughtAgeiaandtheirPhysXengine(aproprietaryrealtimephysicsenginemiddlewareSDK)andinte-

grateditintotheirCUDAframework.

tuallyscalewell

beyondMoore’slaw,charapiddevelopmentwecan

certainlyexpecttoseequitesomeinterestingthingstocomeinthisﬁeldofprocessing.

References

[1]WikipediaentryonGPUs

/wiki/GPU

[2]KeesHuizing,Han-WeiShen:“TheGraphicsRenderingPipeline”

keesh/ow/2IV40/

[3]CyrilZeller:“IntroductiontotheHardwareGraphicsPipeline”,PresentationatACMSIGGRAPH

2005

/developer/presentations/2005/I3D/I3D_05_

[4]ExtremeTech3DPipelineTutorial

/article2/0,1697,9722,

[5]AshuRege:“Introductionto3DGraphicsforGames”

/docs/IO/11278/

[6]DirectXDeveloperCenter:“TheDirect3DTransformationPipeline”

/en-us/library/bb206260(VS.85).aspx

[7]MarkColbert:“GPUArchitecture&CG”

/gpuseminar/

[8]GPUGems2,Chapter30:“TheGeForce6SeriesGPUArchitecture”

/developer/GPU_Gems_2/GPU_Gems2_

[9]IEEEMicro,Volume25,Issue2(March2005):“TheGeForce6800”

/?id=1069760

[10]:“NV40-TechnikimDetail”

/artikel/nv40_pipeline/

[11]:“NVIDIAGeForce6800Ultra(NV40)”

/articles2/gffx/

[12]AustinRobison,AbeWinter:“AnOverviewofGraphicsProcessingHardware”

robison/src/gpu_

[13]JohnMontrym,HenryMoreton:“NVIDIAGeForce6800”,HotChips16

/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_

[14]AjitDatar,ApurvaPadhye:“GraphicsProcessingUnitArchitecture”

data0003/Talks/

[15]SvenSchenk:“EineEinfuehrungindieArchitekturmodernerGraphikprozessoren”

/Lehre/Seminar0506/

[16]ThomasScottCrow:“EvolutionoftheGraphicalProcessingUnit”

fredh/papers/thesis/023-crow/

[17]DirectXDeveloperCenter:“AsmShaderReference”

/en-us/library/bb219840(VS.85).aspx

[18]ErikLindholm,StuartOberman:“NVIDIAGeForce8800GPU”

/archives/hc19/2_Mon/HC19.02/

[19]:“SayHelloToDirectX10,Or128ALUsInAction:NVIDIAGeForce8800GTX(G80)”

/articles2/video/

[20]RichardHough,RichardYu:“GPUArchitecture”

/courses/ece685/slides/

[21]TechnicalBrief:“NVIDIAGeForce8800GPUArchitectureOverview”

/object/IO_

[22]GPUGems2,Chapter46:“ImprovedGPUSorting”

[23]TimPurcell:“SortingandSearching”,SIGGRAPH2005GPGPUCOURSE

/s2005/slides/

[24]PeterKipfer,MarkSegal,RuedigerWestermann:“UberFlow:AGPU-BasedParticleEngine”

/previous/www_2004/Presentations/

[25]WikipediaentryonNvidia

/wiki/Nvidia_Corporation

[26]WikipediaentryonATI

/wiki/ATI_Technologies_Inc.

[27]WikipediaentryonCUDA

/wiki/CUDA

[28]WikipediaentryonCTM

/wiki/Close_to_Metal

[29]WilliamMark,HenryMoreton:“3DGraphicsArchitectureTutorial”

/users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_

USB迷 | 专注于互联网分享

图形处理器架构(GPU Architecture)与图形管线(Graphics Pipeline

与本文相关的文章

评论列表 (0)