最新消息: USBMI致力于为网友们分享Windows、安卓、IOS等主流手机系统相关的资讯以及评测、同时提供相关教程、应用、软件下载等服务。

图形处理器架构(GPU Architecture)与图形管线(Graphics Pipeline

IT圈 admin 39浏览 0评论

2024年7月22日发(作者:宜方仪)

GPUs-GraphicsProcessingUnits

MinhTriDoDinh

-Dinh@

VertiefungsseminarArchitekturvonProzessoren,SS2008

InstituteofComputerScience,UniversityofInnsbruck

July7,2008

Thisores

theirarchitectureandunderlyingdesignprinciples,usingchipsfromNvidia’s”Geforce”seriesas

examples.

1Introduction

BeforewediveintothearchitecturaldetailsofsomeexampleGPUs,we’llhavealookatsomebasicconcepts

ofgraphicsprocessingand3Dgraphics,whichwillmakeiteasierforustounderstandthefunctionalityof

GPUs

1.1WhatisaGPU?

AGPU(GraphicsProcessingUnit)isessentiallyadedicatedhardwaredevicethatisresponsiblefortrans-

paper,wewillfocusonthe3Dgraphics,sincethatis

whatmodernGPUsaremainlydesignedfor.

1.2Theanatomyofa3Dscene

Figure1:A3Dscene

3Dscene:Acollectionof3Dobjectsandlights.

1

Figure2:Object,triangleandvertices

3Dobjects:Arbitraryobjects,nsarecomposedof

vertices.

Vertex:APointwithspatialcoordinatesandotherinformationsuchascolorandtexturecoordinates.

Figure3:Acubewithacheckerboardtexture

Texture:Animagethatismappedontothesurfaceofa3Dobject,whichcreatestheillusionofanobject

ticesofanobjectstoretheso-calledtexturecoordinates

(2-dimensionalvectors)thatspecifyhowatextureismappedontoanygivensurface.

Figure4:Texturecoordinatesofatrianglewithabricktexture

2

Inordertotranslatesucha3Dscenetoa2Dimage,thedatahastogothroughseveralstagesofa”Graphics

Pipeline”

1.3TheGraphicsPipeline

Figure5:The3DGraphicsPipeline

First,amongsomeotheroperations,wehavetotranslatethedatathatisprovidedbytheapplicationfrom

3Dto2D.

1.3.1GeometryStage

Thisstageisalsoreferredtoasthe”TransformandLighting”rtotranslatethescenefrom

3Dto2D,alltheobjectsofasceneneedtobetransformedtovariousspaces-eachwithitsowncoordinate

ransformationsareappliedona

vertex-to-vertexbasis.

MathematicalPrinciples

Apointin3Dspaceusuallyhas3coordinates,epusing3-dimensional

vectorsforthetransformationcalculations,werunintotheproblemthatdifferenttransformationsrequire

diff:translatingavertexrequiresadditionwithavectorwhilerotatingavertexrequires

multiplicationwitha3x3matrix).Wecircumventthisproblemsimplybyextendingthe3-dimensionalvector

byanothercoordinate(thew-coordinate),y,

everytransformationcanbeappliedbymultiplyingthevectorwithaspecific4x4matrix,makingcalculations

mucheasier.

Figure6:Transformationmatricesfortranslation,rotationandscaling

3

Lighting,theothermajorpartofthispipelinestageiscalculatedusingthenormalvectorsofthesurfaces

inationwiththepositionofthecameraandthepositionofthelightsource,onecan

computethelightingpropertiesofagivenvertex.

Figure7:Calculatinglighting

Fortransformation,westartoutinthemodelspacewhereeachobject(model)hasitsowncoordinate

system,whichfacilitatesgeometrictransformationssuchastranslation,rotationandscaling.

Afterthat,wemoveontotheworldspace,whereallobjectswithinthescenehaveaunifiedcoordinate

system.

Figure8:Worldspacecoordinates

Thenextstepisthetransformationintoviewspace,whichlocatesacameraintheworldspaceandthen

transformsthescene,suchthatthecameraisattheoriginoftheviewspace,lookingstraightintothe

andefineaviewvolume,theso-calledviewfrustrum,whichwillbeusedto

decidewhatactuallyisvisibleandneedstoberendered.

4

Figure9:Thecamera/eye,theviewfrustrumanditsclippingplanes

Afterthat,theverticesaretransformedintoclipspaceandassembledintoprimitives(trianglesorlines),

bjectsthatareoutsideofthefrustrumdon’tneedto

berenderedandcanbediscarded,objectsthatarepartiallyinsidethefrustrumneedtobeclipped(hence

thename),andnewverticeswithpropertextureandcolorcoordinatesneedtobecreated.

Aperspectivedivideisthenperformed,whichtransformsthefrustrumintoacubewithnormalized

coordinates(xandybetween-1and1,zbetween0and1)whiletheobjectsinsidethefrustrumarescaled

thisnormalizedcubefacilitatesclippingoperationsandsetsuptheprojectioninto2D

space(thecubesimplyneedstobe”flattened”).

Figure10:Transformingintoclipspace

Finally,wecanmoveintoscreenspacewherexandycoordinatesaretransformedforproper2Ddisplay

(inagivenwindow).(Notethatthez-coordinateofavertexisretainedforlaterdepthoperations)

Figure11:Fromviewspacetoscreenspace

5

Note,thatthetexturecoordinatesneedtobetransformedaswellandadditionallybesidesclipping,sur-

facesthataren’tvisible(ksideofacube)areremovedaswell(so-calledbackfaceculling).

Theresultisa2Dimageofthe3Dscene,andwecanmoveontothenextstage.

1.3.2RasterizationStage

needstotraversethe2Dimageandconvert

thedataintoanumberof”pixel-candidates”,so-calledfragments,whichmaylaterbecomepixelsofthe

fientisadatastructurethatcontainsattributessuchasposition,color,depth,texture

coordinates,eneratedbycheckingwhichpartofanygivenprimitiveintersectswithwhichpixel

gmentintersectswithaprimitive,butnotanyofitsvertices,theattributesofthat

fragmenthavetobeadditionallycalculatedbyinterpolatingtheattributesbetweenthevertices.

Figure12:Rasterizingatriangleandinterpolatingitscolorvalues

Afterthat,furtherstepscanbemadetoobtainthefiarecalculatedbycombining

textureswithotherattributessuchascolorandlightingorbycombiningafragmentwitheitheranother

translucentfragment(so-calledalphablending)oroptionalfog(anothergraphicaleffect).

Visibilitychecksareperformedsuchas:

Scissortest(checkingvisibilityagainstarectangularmask)

Stenciltest(similartoscissortest,onlyagainstarbitrarypixelmasksinabuffer)

Depthtest(comparingthez-coordinateoffragments,discardingthosewhicharefurtheraway)

Alphatest(checkingvisibilityagainsttranslucentfragments)

Additionalprocedureslikeanti-aliasingcanbeappliedbeforeweachievethefinalresult:anumberof

pixelsthatcanbewrittenintomemoryforlaterdisplay.

Thisconcludesourshorttourthroughthegraphicspipeline,whichhopefullygivesusabetterideaofwhat

kindoffunctionalitywillberequiredofaGPU.

6

2EvolutionoftheGPU

SomehistoricalkeypointsinthedevelopmentoftheGPU:

•Effortsforrealtimegraphicshavebeenmadeasearlyas1944(MIT’sproject”Whirlwind”)

•Inthe1980s,hardwaresimilartomodernGPUsbegantoshowupintheresearchcommunity(“Pixel-

Planes”,aaparallelsystemforrasterizingandtexture-mapping3Dgeometry

•Graphicchipsintheearly1980swereverylimitedintheirfunctionality

•Inthelate1980sandearly1990s,high-speed,general-purposemicroprocessorsbecamepopularfor

implementinghigh-endGPUs(nstruments’TMS340)

•1985Thefirstmass-marketgraphicsacceleratorwasincludedintheCommodoreAmiga

•1991S3introducedthefirstsinglechip2D-accelerator,theS386C911

•1995Nvidiareleasesoneofthefirst3Daccelerators,theNV1

•1999Nvidia’sGeforce256isthefirstGPUtoimplementTransformandLightinginHardware

•2001NvidiaimplementsthefirstprogrammableshaderunitswiththeGeforce3

•2005ATIdevelopsthefirstGPUwithunifiedshaderarchitecturewiththeATIXenosfortheXBox

360

•2006NvidialaunchesthefirstunifiedshaderGPUforthePCwiththeGeforce8800

7

3FromTheorytoPractice-theGeforce6800

3.1Overview

ModernGPUscloselyfollowthelayoutofthegraphicspipelinedescribedinthefividia’s

Geforce6800asanexamplewewillhaveacloserlookatthearchitectureofmoderndayGPUs.

Sincebeingfoundedin1993,thecompanyNVIDIAhasbecomeoneofthebiggestmanufacturersofGPUs

(besidesATI),havingreleasedimportantchipssuchastheGeforce256,andtheGeforce3.

Launchedin2004,theGeforce6800belongstotheGeforce6series,Nvidia’ssixthgenerationofgraphics

chipsetsandthefourthgenerationthatfeaturedprogrammability(moreonthatlater).

ThefollowingimageshowsaschematicviewoftheGeforce6800anditsfunctionalunits.

Figure13:SchematicviewoftheGeforce6800

Youcanalreadyseehoweachofthefunctionalunitscorrespondtothestagesofthegraphicspipeline.

Westartwithsixparallelvertexprocessorsthatreceivedatafromthehost(theCPU)andperformoper-

ationssuchastransformationandlighting.

8

Next,theoutputgoesintothetrianglesetupstagewhichtakescareofprimitiveassembly,cullingand

clipping,orce6800hasanadditional

Z-cullunitwhichallowstoperformanearlyfragmentvisibilitycheckbasedondepth,furtherimprovingthe

efficiency.

Wethenmoveontothesixteenfragmentprocessorswhichoperatein4parallelunitsandcomputesthe

outputcolorsofeachfragment.

Thefragmentcrossbarisalinkingelementthatisbasicallyresponsiblefordirectingoutputpixelstoany

availablepixelengine(alsocalledROP,shortforRasterOperator),thusavoidingpipelinestalls.

The16pixelenginesarethefinalstageofprocessing,andperformoperationssuchasalphablending,

depthtests,etc.,beforedeliveringthefinalpixeltotheframebuffer.

3.2InDetail

Figure14:AmoredetailedviewoftheGeforce6800

WhilemostpartsoftheGPUarefixedfunctionunits,thevertexandfragmentprocessorsoftheGeforce

6800offerprogrammabilitywhichwasfirstintroducedtothegeforcechipsetlinewiththegeforce3(2001).

We’llhaveamoredetailedlookattheunitsinthefollowingsections.

9

3.2.1VertexProcessor

Figure15:Avertexprocessor

Thevertexprocessorsaretheprogrammableunitsresponsibleforallthevertextransformationsandat-

eratewith4-dimensionaldatavectorscorrespondingwiththeaforementioned

homogeneouscoordinatesofavertex,using32bitspercoordinate(hencethe128bitsofaregister).Instruc-

tionsare123bitslongandarestoredintheInstructionRAM.

Thedatapathofavertexprocessorconsistsof:

•Amultiply-addunitfor4-dimensionalvectors

•Ascalarspecialfunctionunit

•Atextureunit

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

dp4dst,src0,src1

expdst,src

dstdest,src0,src1

nrmdst,src

rsqdst,src

Computesthefour-componentdotproductofthesourceregisters

Providesexponential2

x

Calculatesadistancevector

Normalizea3Dvector

Computesthereciprocalsquareroot(positiveonly)ofthesource

scalar

Registersinthevertexprocessorinstructionscanbemodified(withfewexceptions):

Negatetheregistervalue

Taketheabsolutevalueoftheregister

Swizzling(copyanysourceregistercomponenttoanytemporaryregistercomponent)

Maskdestinationregistercomponents

10

Othertechnicaldetails:

VertexprocessorsareMIMDunits(MultipleInstructionMultipleData)

TheyuseVLIW(VeryLongInstructionWords)

Theyoperatewith32-bitfloatingpointprecision

Eachvertexprocessorrunsupto3threadstohidelatency

Eachvertexprocessorcanperformafour-wideMAD(Multiply-Add)andaspecialfunctioninone

cycle

3.2.2FragmentProcessor

Figure16:Afragmentprocessor

egroupedto4biggerunitswhichoperatesimulta-

neouslyon4fragmentseach(aso-calledquad).Theycantakeposition,color,depth,fogaswellasother

arbitrary4-dimensionalattributesasinput.

Thedatapathconsistsof:

•AnInterpolationblockforattributes

•2vectormath(shader)units,eachwithslightlydifferentfunctionality

•Afragmenttextureunit

Superscalarity:

Afragmentprocessorworkswith4-vectors(vector-orientedinstructionset),wheresometimescomponentsof

thevectorneedbetreatedseperately(,alpha).Thus,thefragmentprocessorsupportsco-issueing

ofthedata,whichmeanssplittingthevectorinto2partsandexecutingdifferentoperationsontheminthe

orts3-1and2-2splitting(2-2co-issuewasn’tpossibleearlier).

Additionally,italsofeaturesdualissue,whichmeansexecutingdifferentoperationsonthe2vectormath

unitsinthesameclock.

TextureUnit:

Thetextureunitisafloating-pointtextureprocessorwhichfetchesandfin-

nectedtoalevel1texturecache(whichstorespartsofthetexturesthatareused).

11

Shaderunits1and2:

Eachshaderunitislimitedinitsabilities,offeringcompletefunctionalitywhenusedtogether.

Figure17:BlockdiagramofShaderUnit1and2

ShaderUnit1:

Green:Acrossbarwhichdistributestheinputcomingeiterfromtherasterizerorfromtheloopback

Red:Interpolators

Yellow:Aspecialfunctionunit(forfunctionssuchasReciprocal,ReciprocalSquareRoot,etc.)

Cyan:MULchannels

Orange:Aunitfortextureoperations(notthefragmenttextureunit)

Theshaderunitcanperform2operationsperclock:

AMULona3-dimensionalvectorandaspecialfunction,aspecialfunctionandatextureoperation,or2

MULs.

TheoutputofthespecialfunctionunitcangointotheMULchannels.

ThetexturegetsinputfromtheMULunitanddoesLOD(LevelOfDetail)calculations,beforepassing

gmenttextureunitthenperformstheactualsampling

andwritesthedataintoregistersforthesecondshaderunit.

Theshaderunitcansimplypassdataaswell.

ShaderUnit2:

Red:Acrossbar

Cyan:4MULchannels

Gray:4ADDchannels

Yellow:1specialfunctionunit

Thecrossbarsplitstheinputonto5channels(4components,1channelstaysfree).

TheADDunitsareadditionallyconnected,allowingadvancedoperationssuchasadotproductinoneclock.

Again,theshaderu

specialfunctionisused,theMADunitcanperformupto2operationsfromthislist:MUL,ADD,MAD,

12

DP,oranyotherinstructionbasedontheseoperations.

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

cmpdst,src0,src1,src2

dsxdst,src

dsydst,src

sincosdst.{x|y|xy},src0.{x|y|z|w}

texlddst,src0,src1

Choosesrc1ifsrc0>=ise,parison

isdoneperchannel

Computetherateofchangeintherendertarget’sx-direction

Computetherateofchangeintherendertarget’sy-direction

Computessineandcosine,inradians

Sampleatextureataparticularsampler,usingprovidedtexture

coordinates

Registersinthefragmentprocessorinstructionscanbemodified(withfewexceptions):

•Negatetheregistervalue

•Taketheabsolutevalueoftheregister

•Maskdestinationregistercomponents

Othertechnicaldetails:

•Thefragmentprocessorscanperformoperationswithin16or32floatingpointprecision(

unitusesonly16bitprecisionforitscalculationssincethatissufficient)

•ThequadsoperateasSIMDunits

•TheyuseVLIW

•Theyrunupto100softhreadstohidetexturefetchlatency(˜256perquad)

•Afragmentprocessorcanperformupto8operationspercycle/4mathoperationsifthere’satexture

fetchinshader1

Figure18:Possibleoperationspercycle

•Thefragmentprocessorshavea2leveltexturecache

•Thefogunitcanperformfogblendingonthefiplemented

withfixedpointprecisionsincethat’ssufficientforfogandsavesperformance.

Theequation:out=FogColor*fogFraction+SrcColor*(1-fogFraction)

13

•There’ssupportformultiplerendertargets,thepixelprocessorcanoutputtouptofourseperate

buffers(4x4values,color+depth)

3.2.3PixelEngine

Figure19:Apixelengine

Lastinthepipelinearethe16pixelengines(rasteroperators).Eachpixelengineconnectstoaspecific

helosslesscoloranddepthcompression,thedepthandcolorunits

performdepth,colorandstenciloperationsbeforewritingthefitivatedthepixelengines

alsoperformmultisampleantialiasing.

3.2.4Memory

From“GPUGems2,Chapter30:TheGeForce6SeriesGPUArchitecture”:

“Thememorysystemispartitionedintouptofourindependentmemorypartitions,each

withitsowndynamicrandom-accessmemories(DRAMs).GPUsusestandardDRAMmodules

ratherthancustomRAMtechnologiestotakeadvantageofmarketeconomiesandtherebyreduce

smaller,independentmemorypartitionsallowsthememorysubsystemtooperate

efficderedsurfaces

arestoredintheDRAMs,whiletexturesandinputdatacanbestoredintheDRAMsorin

rindependentmemorypartitionsgivetheGPUawide(256bits),

flexiblememorysubsystem,allowingforstreamingofrelativelysmall(32-byte)memoryaccesses

atnearthe35GB/secphysicallimit.”

14

3.3Performance

•425MHzinternalgraphicsclock

•550MHzmemoryclock

•256-MBmemorysize

•35.2GByte/secondmemorybandwidth

•600millionvertices/second

•6.4billiontexels/second

•12.8billionpixels/second,renderingz/stencil-only(usefulforshadowvolumesandshadowbuffers)

•6four-widefp32vectorMADsperclockcycleinthevertexshader,plusonescalarmultifunction

operation(acomplexmathoperation,suchasasineorreciprocalsquareroot)

•16four-widefp32vectorMADsperclockcycleinthefragmentprocessor,plus16four-widefp32

multipliesperclockcycle

•64pixelsperclockcycleearlyz-cull(rejectrate)

•120+Gflopspeak(equaltosix5-GHzPentium4processors)

•Upto120Wenergyconsumption(thecardhastwoadditionalpowerconnectors,thepowersources

arerecommendedtobenolessthan480W)

15

4ComputationalPrinciples

StreamProcessing:

TypicalCPUs(thevonNeumannarchitecture)suffe

verysensitivetosuchbottlenecks,andthereforeneedadifferentarchitecture,theyareessentiallyspecial

purposestreamprocessors.

Astrmisasetofdata

amprocessors,everykerneltakesoneormorestreamsasinputand

outputsoneormorestreams,whileitexecutesitsoperationsoneverysingleelementoftheinputstreams.

Instreamprocessorsyoucanachieveseverallevelsofparallelism:

•Instructionlevelparallelism:kernelsperformhundredsofinstructionsoneverystreamelement,you

achieveparallelismbyperformingindependentinstructionsinparallel

•Datalevelparallelism:kernelsperformthesameinstructionsoneachstreamelement,youachieve

parallelismbyperformingoneinstructiononmanystreamelementsatatime

•Tasklevelparallelism:Havemultiplestreamprocessorsdividetheworkfromonekernel

Streamprocessorsdonotusecachingthesamewaytraditionalprocessorsdosincetheinputdatasetsare

usuallymuchlargerthanmostcachesandthedataisbarelyreused-withGPUsforexamplethedatais

usuallyrenderedandthendiscarded.

WeknowGPUshavetoworkwithlargeamountsofdata,thecomputationsaresimplerbuttheyneed

tobefastandparallel,soitbecomesclearthatthestreamprocessorarchitectureisverywellsuitedforGPUs.

Continuingtheseideas,GPUsemployfollowingstrategiestoincreaseoutput:

Pipelining:Pipeliningdescribestheideaofbreakingdownajobintomultiplecomponentsthateachperform

epipelined,whichmeansthatinsteadofperformingcompleteprocessingofapixel

beforemovingontothenext,youfillthepipelinelikeanassemblylinewhereeachcomponentperformsa

eprocessingapixelmaytakemultipleclock

cycles,youstillachieveanoutputofonepixelperclocksinceyoufillupthewholepipe.

Parallelism:Duetothenatureofthedata-parallelismcanbeappliedonaper-vertexorper-pixelbasis

-andthetypeofprocessing(highlyrepetitive)GPUsareverysuitableforparallelism,youcouldhavean

unlimitedamountofpipelinesnexttoeachother,aslongastheCPUisabletokeepthembusy.

OtherGPUcharacteristics:

•GPUscanaffordlargeamountsoffloatingpointcomputationalpowersincetheyhavelowercontrol

overhead

•Theyusededicatedfunctionalunitsforspecializedtaskstoincreasespeeds

•GPUmemorystruggleswithbandwidthlimitations,andthereforeaimsformaximumbandwidthusage,

employingstrategieslikedatacompression,multiplethreadstocopewithlatency,schedulingofDRAM

cyclestominimizeidledata-bustime,etc.

•Cachesaredesignedtosupporteffectivestreamingwithlocalreuseofdata,ratherthanimplementing

acachethatachieves99%hitrates(whichisn’tfeasible),GPUcachedesignsassumea90%hitrate

withmanymissesinflight

•GPUshavemanydifferentperformanceregimesallwithdifferentcharacteristicsandneedtobede-

signedaccordingly

16

4.1TheGeforce6800asageneralprocessor

YoucanseetheGeforce6800asageneralprocessorwithalotoffloating-pointhorsepowerandhighmemory

bandwidththatcanbeusedforotherapplicationsaswell.

Figure20:AgeneralviewoftheGeforce6800architecture

LookingattheGPUthatway,weget:

•2seriallyrunningprogrammableblockswithfp32capability.

•TheRasterizercanbeseenasaunitthatexpandsthedataintointerpolatedvalues(fromonedata-

”point”tomultiple”fragments”).

•WithMRT(MultipleRenderTargets),thefragmentprocessorcanoutputupto16scalarfloating-point

valuesatatime.

•Severalpossibilitiestocontrolthedataflowbyusingthevisibilitychecksofthepixelenginesorthe

Z-cullunit

17

5Thenextstep:theGeforce8800

AftertheGeforce7serieswhichwasacontinuationoftheGeforce6800architecture,Nvidiaintroducedthe

bythedesiretoincreaseperformance,improveimagequalityandfacilitate

programming,theGeforce8800presentedasignificantevolutionofpastdesigns:aunifiedshaderarchitec-

ture(Note,thatATIalreadyusedthisarchitecturein2005withtheXBOX360GPU).

Figure21:Fromdedicatedtounifiedarchitecture

Figure22:AschematicviewoftheGeforce8800

TheunifiedshaderarchitectureoftheGeforce8800essentiallyboilsdowntothefactthatallthedifferent

shaderstagesbecomeonesinglestagethatcanhandleallthedifferentshaders.

AsyoucanseeinFigure22,insteadofdifferentdedicatedunitswenowhaveasinglestreamingprocessor

familiarunitssuchastherasteroperators(blue,atthebottom)andthetrianglesetup,

stheseunitswenowhaveseveralmanagingunitsthatprepareand

managethedataasitflowsintheloop(vertex,geometryandpixelthreadissue,inputassemblerandthread

processor).

18

Figure23:Thestreamingprocessorarray

xtureprocessorclusterin

turnconsistsof2streamingmultiprocessorsand1texturepipe.

Astreamieamingpro-

cessorsworkwith32-bitscalardata,basedontheideathatshaderprogramsarebecomingmoreandmore

scalar,makingavectorarchitecturemoreineffiedrivenbyahigh-speedclockthatisseperate

ltiprocessorcan

have768hardwarescheduledthreads,groupedtogetherto24SIMD”warps”(Awarpisagroupofthreads).

Thetexturepipeconsistsof4textureaddressingand8texturefiormstextureprefetching

andfilteringwithoutconsumingALUresources,furtherincreasingefficency.

Itmple,theold

problemofconstantlychangingworkloadandoneshaderstagebecomingaprocessingbottleneckissolved

sincetheunitscanadaptdynamically,nowthattheyareunified.

Figure24:Workloadbalancingwithbotharchitectures

Withasingleinstructionsetandthesupportoffp32throughoutthewholepipeline,aswellasthesupport

ofnewdatatypes(integercalculations),programmingtheGPUnowbecomeseasieraswell.

19

6GeneralPurposeProgrammingontheGPU-anexample

WeusethebitonicmergesortalgorithmasanexampleforefficientlyimplementingalgorithmsonaGPU.

Bitonicmergesort:

Bitonicmergesortworksbyrepeatedlybuildingbitoniclistsoutofasetofelementsandsortingthem.A

bitoniclistisaconcatenationoftwomonotoniclists,oneascendingandonedescending.

E.g.:

ListA=(3,4,7,8)

ListB=(6,5,2,1)

ListAB=(3,4,7,8,6,5,2,1)isabitoniclist

ListBA=(6,5,2,1,3,4,7,8)isalsoabitoniclist

Bitoniclistshaveaspecialpropertythatisusedinbitonicmergesort:Supposeyouhaveabitoniclistof

rearrangethelistsothatyougettwohalveswithnelementswhereeachelement(i)of

thefirsthalfislessthanorequaltoeachcorrespondingelement(i+n)inthesecondhalf(orgreaterthanor

equal,ifthelistdescendsfirstandthenascends)ppensbycom-

ocedureiscalledabitonicmerge.

Bitonicmergesortworksbyrecursivelycreatingandmergingbitonicliststhatincreaseintheirsizeuntil

25illustratestheprocess:

Figure25:Thedifferentstagesofthealgorithm

Thesortingprw

sultsinbitonicmergesorthavingacomplexityof

O(nlog

2

(n)+log(n))whichisworsethanquicksort,butthealgorithmhasnoworst-casescenario(where

quicksortreachesO(n

2

).

theoperationscanbeperformedinparalleland

thelengthstaysconstant,nimplementingthisalgorithmon

theGPU,wewanttomakeuseofasmanyresourcesaspossible(bothinparallelaswellasverticallyalong

20

thepipeline),especiallyconsideringthattheGPUhasshortcomingsaswell,suchasthelackofsupport

mple,simplylettingthefragmentprocessorstagehandleallthe

calculationsmightwork,ble

solutionlookslikethis:

Inthisalgorithm,wehavegroupsofelements(fragments)thathavethesamesortingconditions,while

edrawavertexquadovertwoadjacentgroupsand

setappropriateflagsateachcorner,

example,ifwesettheleftcornersto-1andtherightcornersto+1,wecancheckwhereafragmentbelongs

tobysimplylookingatitssign(theinterpolationprocesstakescareofthat).

Figure26:Usingvertexflagsandtheinterpolatortodeterminecompareoperations

Next,weneedtodeterminewhichcompareoperationtouseandweneedtolocatethepartneritemto

nagainbeaccomplishedbyusingtheflgthecompareoperationtoless-than

andmultiplyingwiththeflagvalueimplicitlyflipstheoperationtogreater-equalhalfwayacrossthequad.

Locatingthepartneritemhappensbymultiplyingthesignoftheflagwithanabsolutevaluethatspecifies

thedistancebetweentheitems.

Inordertosortelementsofanarray,westorethemina2Dtexture.

Eachrowisatend

thequadsovertherowsofthe2Dtextureandusetheinterpolation,wecanmodulatethecomparisonsothe

y,pairsofrowsbecomebitonic

sequencesagainwhichcanbesortedinthesamewaywesortedthecolumnsofthesinglerows,simplyby

transposingthequads.

Asafinaloptimizationwereducetexturefetchingbypackingtwoneighbouringkeypairsintoonefrag-

ment,sincetheshaderoperateson4-vectors.

Performancecomparison:

std:sort:16-BitData,

Pentium43.0GHz

NFullSorted

Sorts/SecKeys/Sec

2

25682.55.4M

2

51220.65.4M

2

10244.75.0M

BitonicMergeSort:16-BitFloatData,

NVIDIAGeforce6800Ultra

NPassesFullSorted

Sorts/SecKey/Sec

2

25612090.076.1M

2

51215318.34.8M

2

10241903.63.8M

21

GLSL(OpenGLShadingLanguage)codesample,implementingthecombinedpasses0and1forrow-wise

sortingofthebitonicmergesort:

uniformsampler2DPackedData:

//contentsofthetexcoorddata

#defineOwnPosgl_TexCoord[0].xy

#defineSearchDirgl_TexCoord[0].z

#defineCompOpgl_TexCoord[0].w

#defineDistancegl_TexCoord[1].x

#defineStridegl_TexCoord[1].y

#defineHeightgl_TexCoord[1].z

#defineHalfStrideMHalfgl_TexCoord[1].w

voidmain(void)

{

//getself

vec4self=texture2D(PackedData,OwnPos);

//restoresignofsearchdirectionandassemblevectortopartner

vec2adr=vec2((SearchDir<0.0)?-Distance:Distance,0.0);

//getthepartner

vec4partner=texture2D(PackedData,OwnPos+adr);

//switchascending/descendingsortforeveryotherrow

//bymodifyingcomparisonflag

floatcompare=CompOp*-(mod(floor(gl_TexCoord[0].y*Height),Stride)-HalfStrideMHalf);

//xandyarethekeysofthetwoitems

//-->multiplywithcomparisonflag

vec4keys=compare*vec4(self.x,self.y,partner.x,partner.y);

//comparethekeysandstoreaccordingly

//zandwaretheindices

//-->justcopythemaccordingly

vec4result;

=(keys.x

=(keys.y

//dopass0

compare*=adr.x;

gl_FragColor=(result.x*compare

}

22

7Currentandfuturedevelopments

Nvidia’scurrenttopofthelinemodelofgraphicscardsistheGeforceGTX280(GTX200series),an

evolutionoftheunifiedshaderarchitectureoftheGeforce8800,sportingalmostdoubletheshadercount

(from128to240)nchdatewasthe17thofJune2008.

ATI(nowmergedintoAMD)followedin2007withitsfirstunifiedshaderGPUforthePC(Radeon

HD2900XT),renttop

ofthelinemodelistheRadeonHD3870X2(whichactuallysports2GPUsononecard)whichwasreleased

inJanuary2008.

ATI/AMDaresoontofollowupwithananswertoNvidia’sGTX280:theRadeonHD4870(slatedfor

somewherearoundJuly2008).

Withtheadventoftheunifiedshaderarchitecture,thetopicofgeneral-purposecomputingonaGPUhas

,GPUshavemadetheirwayintonon-graphicsfieldsasvariedas

audiosignalprocessingandweatherforecasting.

BothATI/AMDandNvidiahavemadeeff

releasedCTM(CloseToMetal),ewritingthesoftware,CTM’s

commercialsuccessorAMDStreamSDKwasreleasedinDecember2007,nowprovidingadditionalhighlevel

toolsforgeneral-purposeaccesstoAMDgraphicshardware.

NvidiainitiallyreleasedtheCUDA(ComputeUnifiedDeviceArchitecture)SDKinFebruary2007,aC

lauary2008,Nvidia

boughtAgeiaandtheirPhysXengine(aproprietaryrealtimephysicsenginemiddlewareSDK)andinte-

grateditintotheirCUDAframework.

tuallyscalewell

beyondMoore’slaw,charapiddevelopmentwecan

certainlyexpecttoseequitesomeinterestingthingstocomeinthisfieldofprocessing.

References

[1]WikipediaentryonGPUs

/wiki/GPU

[2]KeesHuizing,Han-WeiShen:“TheGraphicsRenderingPipeline”

/

~

keesh/ow/2IV40/

[3]CyrilZeller:“IntroductiontotheHardwareGraphicsPipeline”,PresentationatACMSIGGRAPH

2005

/developer/presentations/2005/I3D/I3D_05_

[4]ExtremeTech3DPipelineTutorial

/article2/0,1697,9722,

[5]AshuRege:“Introductionto3DGraphicsforGames”

/docs/IO/11278/

[6]DirectXDeveloperCenter:“TheDirect3DTransformationPipeline”

/en-us/library/bb206260(VS.85).aspx

[7]MarkColbert:“GPUArchitecture&CG”

/gpuseminar/

[8]GPUGems2,Chapter30:“TheGeForce6SeriesGPUArchitecture”

/developer/GPU_Gems_2/GPU_Gems2_

[9]IEEEMicro,Volume25,Issue2(March2005):“TheGeForce6800”

/?id=1069760

[10]:“NV40-TechnikimDetail”

/artikel/nv40_pipeline/

23

[11]:“NVIDIAGeForce6800Ultra(NV40)”

/articles2/gffx/

[12]AustinRobison,AbeWinter:“AnOverviewofGraphicsProcessingHardware”

/

~

robison/src/gpu_

[13]JohnMontrym,HenryMoreton:“NVIDIAGeForce6800”,HotChips16

/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_

[14]AjitDatar,ApurvaPadhye:“GraphicsProcessingUnitArchitecture”

/

~

data0003/Talks/

[15]SvenSchenk:“EineEinfuehrungindieArchitekturmodernerGraphikprozessoren”

/Lehre/Seminar0506/

[16]ThomasScottCrow:“EvolutionoftheGraphicalProcessingUnit”

/

~

fredh/papers/thesis/023-crow/

[17]DirectXDeveloperCenter:“AsmShaderReference”

/en-us/library/bb219840(VS.85).aspx

[18]ErikLindholm,StuartOberman:“NVIDIAGeForce8800GPU”

/archives/hc19/2_Mon/HC19.02/

[19]:“SayHelloToDirectX10,Or128ALUsInAction:NVIDIAGeForce8800GTX(G80)”

/articles2/video/

[20]RichardHough,RichardYu:“GPUArchitecture”

/courses/ece685/slides/

[21]TechnicalBrief:“NVIDIAGeForce8800GPUArchitectureOverview”

/object/IO_

[22]GPUGems2,Chapter46:“ImprovedGPUSorting”

[23]TimPurcell:“SortingandSearching”,SIGGRAPH2005GPGPUCOURSE

/s2005/slides/

[24]PeterKipfer,MarkSegal,RuedigerWestermann:“UberFlow:AGPU-BasedParticleEngine”

/previous/www_2004/Presentations/

[25]WikipediaentryonNvidia

/wiki/Nvidia_Corporation

[26]WikipediaentryonATI

/wiki/ATI_Technologies_Inc.

[27]WikipediaentryonCUDA

/wiki/CUDA

[28]WikipediaentryonCTM

/wiki/Close_to_Metal

[29]WilliamMark,HenryMoreton:“3DGraphicsArchitectureTutorial”

/users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_

24

2024年7月22日发(作者:宜方仪)

GPUs-GraphicsProcessingUnits

MinhTriDoDinh

-Dinh@

VertiefungsseminarArchitekturvonProzessoren,SS2008

InstituteofComputerScience,UniversityofInnsbruck

July7,2008

Thisores

theirarchitectureandunderlyingdesignprinciples,usingchipsfromNvidia’s”Geforce”seriesas

examples.

1Introduction

BeforewediveintothearchitecturaldetailsofsomeexampleGPUs,we’llhavealookatsomebasicconcepts

ofgraphicsprocessingand3Dgraphics,whichwillmakeiteasierforustounderstandthefunctionalityof

GPUs

1.1WhatisaGPU?

AGPU(GraphicsProcessingUnit)isessentiallyadedicatedhardwaredevicethatisresponsiblefortrans-

paper,wewillfocusonthe3Dgraphics,sincethatis

whatmodernGPUsaremainlydesignedfor.

1.2Theanatomyofa3Dscene

Figure1:A3Dscene

3Dscene:Acollectionof3Dobjectsandlights.

1

Figure2:Object,triangleandvertices

3Dobjects:Arbitraryobjects,nsarecomposedof

vertices.

Vertex:APointwithspatialcoordinatesandotherinformationsuchascolorandtexturecoordinates.

Figure3:Acubewithacheckerboardtexture

Texture:Animagethatismappedontothesurfaceofa3Dobject,whichcreatestheillusionofanobject

ticesofanobjectstoretheso-calledtexturecoordinates

(2-dimensionalvectors)thatspecifyhowatextureismappedontoanygivensurface.

Figure4:Texturecoordinatesofatrianglewithabricktexture

2

Inordertotranslatesucha3Dscenetoa2Dimage,thedatahastogothroughseveralstagesofa”Graphics

Pipeline”

1.3TheGraphicsPipeline

Figure5:The3DGraphicsPipeline

First,amongsomeotheroperations,wehavetotranslatethedatathatisprovidedbytheapplicationfrom

3Dto2D.

1.3.1GeometryStage

Thisstageisalsoreferredtoasthe”TransformandLighting”rtotranslatethescenefrom

3Dto2D,alltheobjectsofasceneneedtobetransformedtovariousspaces-eachwithitsowncoordinate

ransformationsareappliedona

vertex-to-vertexbasis.

MathematicalPrinciples

Apointin3Dspaceusuallyhas3coordinates,epusing3-dimensional

vectorsforthetransformationcalculations,werunintotheproblemthatdifferenttransformationsrequire

diff:translatingavertexrequiresadditionwithavectorwhilerotatingavertexrequires

multiplicationwitha3x3matrix).Wecircumventthisproblemsimplybyextendingthe3-dimensionalvector

byanothercoordinate(thew-coordinate),y,

everytransformationcanbeappliedbymultiplyingthevectorwithaspecific4x4matrix,makingcalculations

mucheasier.

Figure6:Transformationmatricesfortranslation,rotationandscaling

3

Lighting,theothermajorpartofthispipelinestageiscalculatedusingthenormalvectorsofthesurfaces

inationwiththepositionofthecameraandthepositionofthelightsource,onecan

computethelightingpropertiesofagivenvertex.

Figure7:Calculatinglighting

Fortransformation,westartoutinthemodelspacewhereeachobject(model)hasitsowncoordinate

system,whichfacilitatesgeometrictransformationssuchastranslation,rotationandscaling.

Afterthat,wemoveontotheworldspace,whereallobjectswithinthescenehaveaunifiedcoordinate

system.

Figure8:Worldspacecoordinates

Thenextstepisthetransformationintoviewspace,whichlocatesacameraintheworldspaceandthen

transformsthescene,suchthatthecameraisattheoriginoftheviewspace,lookingstraightintothe

andefineaviewvolume,theso-calledviewfrustrum,whichwillbeusedto

decidewhatactuallyisvisibleandneedstoberendered.

4

Figure9:Thecamera/eye,theviewfrustrumanditsclippingplanes

Afterthat,theverticesaretransformedintoclipspaceandassembledintoprimitives(trianglesorlines),

bjectsthatareoutsideofthefrustrumdon’tneedto

berenderedandcanbediscarded,objectsthatarepartiallyinsidethefrustrumneedtobeclipped(hence

thename),andnewverticeswithpropertextureandcolorcoordinatesneedtobecreated.

Aperspectivedivideisthenperformed,whichtransformsthefrustrumintoacubewithnormalized

coordinates(xandybetween-1and1,zbetween0and1)whiletheobjectsinsidethefrustrumarescaled

thisnormalizedcubefacilitatesclippingoperationsandsetsuptheprojectioninto2D

space(thecubesimplyneedstobe”flattened”).

Figure10:Transformingintoclipspace

Finally,wecanmoveintoscreenspacewherexandycoordinatesaretransformedforproper2Ddisplay

(inagivenwindow).(Notethatthez-coordinateofavertexisretainedforlaterdepthoperations)

Figure11:Fromviewspacetoscreenspace

5

Note,thatthetexturecoordinatesneedtobetransformedaswellandadditionallybesidesclipping,sur-

facesthataren’tvisible(ksideofacube)areremovedaswell(so-calledbackfaceculling).

Theresultisa2Dimageofthe3Dscene,andwecanmoveontothenextstage.

1.3.2RasterizationStage

needstotraversethe2Dimageandconvert

thedataintoanumberof”pixel-candidates”,so-calledfragments,whichmaylaterbecomepixelsofthe

fientisadatastructurethatcontainsattributessuchasposition,color,depth,texture

coordinates,eneratedbycheckingwhichpartofanygivenprimitiveintersectswithwhichpixel

gmentintersectswithaprimitive,butnotanyofitsvertices,theattributesofthat

fragmenthavetobeadditionallycalculatedbyinterpolatingtheattributesbetweenthevertices.

Figure12:Rasterizingatriangleandinterpolatingitscolorvalues

Afterthat,furtherstepscanbemadetoobtainthefiarecalculatedbycombining

textureswithotherattributessuchascolorandlightingorbycombiningafragmentwitheitheranother

translucentfragment(so-calledalphablending)oroptionalfog(anothergraphicaleffect).

Visibilitychecksareperformedsuchas:

Scissortest(checkingvisibilityagainstarectangularmask)

Stenciltest(similartoscissortest,onlyagainstarbitrarypixelmasksinabuffer)

Depthtest(comparingthez-coordinateoffragments,discardingthosewhicharefurtheraway)

Alphatest(checkingvisibilityagainsttranslucentfragments)

Additionalprocedureslikeanti-aliasingcanbeappliedbeforeweachievethefinalresult:anumberof

pixelsthatcanbewrittenintomemoryforlaterdisplay.

Thisconcludesourshorttourthroughthegraphicspipeline,whichhopefullygivesusabetterideaofwhat

kindoffunctionalitywillberequiredofaGPU.

6

2EvolutionoftheGPU

SomehistoricalkeypointsinthedevelopmentoftheGPU:

•Effortsforrealtimegraphicshavebeenmadeasearlyas1944(MIT’sproject”Whirlwind”)

•Inthe1980s,hardwaresimilartomodernGPUsbegantoshowupintheresearchcommunity(“Pixel-

Planes”,aaparallelsystemforrasterizingandtexture-mapping3Dgeometry

•Graphicchipsintheearly1980swereverylimitedintheirfunctionality

•Inthelate1980sandearly1990s,high-speed,general-purposemicroprocessorsbecamepopularfor

implementinghigh-endGPUs(nstruments’TMS340)

•1985Thefirstmass-marketgraphicsacceleratorwasincludedintheCommodoreAmiga

•1991S3introducedthefirstsinglechip2D-accelerator,theS386C911

•1995Nvidiareleasesoneofthefirst3Daccelerators,theNV1

•1999Nvidia’sGeforce256isthefirstGPUtoimplementTransformandLightinginHardware

•2001NvidiaimplementsthefirstprogrammableshaderunitswiththeGeforce3

•2005ATIdevelopsthefirstGPUwithunifiedshaderarchitecturewiththeATIXenosfortheXBox

360

•2006NvidialaunchesthefirstunifiedshaderGPUforthePCwiththeGeforce8800

7

3FromTheorytoPractice-theGeforce6800

3.1Overview

ModernGPUscloselyfollowthelayoutofthegraphicspipelinedescribedinthefividia’s

Geforce6800asanexamplewewillhaveacloserlookatthearchitectureofmoderndayGPUs.

Sincebeingfoundedin1993,thecompanyNVIDIAhasbecomeoneofthebiggestmanufacturersofGPUs

(besidesATI),havingreleasedimportantchipssuchastheGeforce256,andtheGeforce3.

Launchedin2004,theGeforce6800belongstotheGeforce6series,Nvidia’ssixthgenerationofgraphics

chipsetsandthefourthgenerationthatfeaturedprogrammability(moreonthatlater).

ThefollowingimageshowsaschematicviewoftheGeforce6800anditsfunctionalunits.

Figure13:SchematicviewoftheGeforce6800

Youcanalreadyseehoweachofthefunctionalunitscorrespondtothestagesofthegraphicspipeline.

Westartwithsixparallelvertexprocessorsthatreceivedatafromthehost(theCPU)andperformoper-

ationssuchastransformationandlighting.

8

Next,theoutputgoesintothetrianglesetupstagewhichtakescareofprimitiveassembly,cullingand

clipping,orce6800hasanadditional

Z-cullunitwhichallowstoperformanearlyfragmentvisibilitycheckbasedondepth,furtherimprovingthe

efficiency.

Wethenmoveontothesixteenfragmentprocessorswhichoperatein4parallelunitsandcomputesthe

outputcolorsofeachfragment.

Thefragmentcrossbarisalinkingelementthatisbasicallyresponsiblefordirectingoutputpixelstoany

availablepixelengine(alsocalledROP,shortforRasterOperator),thusavoidingpipelinestalls.

The16pixelenginesarethefinalstageofprocessing,andperformoperationssuchasalphablending,

depthtests,etc.,beforedeliveringthefinalpixeltotheframebuffer.

3.2InDetail

Figure14:AmoredetailedviewoftheGeforce6800

WhilemostpartsoftheGPUarefixedfunctionunits,thevertexandfragmentprocessorsoftheGeforce

6800offerprogrammabilitywhichwasfirstintroducedtothegeforcechipsetlinewiththegeforce3(2001).

We’llhaveamoredetailedlookattheunitsinthefollowingsections.

9

3.2.1VertexProcessor

Figure15:Avertexprocessor

Thevertexprocessorsaretheprogrammableunitsresponsibleforallthevertextransformationsandat-

eratewith4-dimensionaldatavectorscorrespondingwiththeaforementioned

homogeneouscoordinatesofavertex,using32bitspercoordinate(hencethe128bitsofaregister).Instruc-

tionsare123bitslongandarestoredintheInstructionRAM.

Thedatapathofavertexprocessorconsistsof:

•Amultiply-addunitfor4-dimensionalvectors

•Ascalarspecialfunctionunit

•Atextureunit

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

dp4dst,src0,src1

expdst,src

dstdest,src0,src1

nrmdst,src

rsqdst,src

Computesthefour-componentdotproductofthesourceregisters

Providesexponential2

x

Calculatesadistancevector

Normalizea3Dvector

Computesthereciprocalsquareroot(positiveonly)ofthesource

scalar

Registersinthevertexprocessorinstructionscanbemodified(withfewexceptions):

Negatetheregistervalue

Taketheabsolutevalueoftheregister

Swizzling(copyanysourceregistercomponenttoanytemporaryregistercomponent)

Maskdestinationregistercomponents

10

Othertechnicaldetails:

VertexprocessorsareMIMDunits(MultipleInstructionMultipleData)

TheyuseVLIW(VeryLongInstructionWords)

Theyoperatewith32-bitfloatingpointprecision

Eachvertexprocessorrunsupto3threadstohidelatency

Eachvertexprocessorcanperformafour-wideMAD(Multiply-Add)andaspecialfunctioninone

cycle

3.2.2FragmentProcessor

Figure16:Afragmentprocessor

egroupedto4biggerunitswhichoperatesimulta-

neouslyon4fragmentseach(aso-calledquad).Theycantakeposition,color,depth,fogaswellasother

arbitrary4-dimensionalattributesasinput.

Thedatapathconsistsof:

•AnInterpolationblockforattributes

•2vectormath(shader)units,eachwithslightlydifferentfunctionality

•Afragmenttextureunit

Superscalarity:

Afragmentprocessorworkswith4-vectors(vector-orientedinstructionset),wheresometimescomponentsof

thevectorneedbetreatedseperately(,alpha).Thus,thefragmentprocessorsupportsco-issueing

ofthedata,whichmeanssplittingthevectorinto2partsandexecutingdifferentoperationsontheminthe

orts3-1and2-2splitting(2-2co-issuewasn’tpossibleearlier).

Additionally,italsofeaturesdualissue,whichmeansexecutingdifferentoperationsonthe2vectormath

unitsinthesameclock.

TextureUnit:

Thetextureunitisafloating-pointtextureprocessorwhichfetchesandfin-

nectedtoalevel1texturecache(whichstorespartsofthetexturesthatareused).

11

Shaderunits1and2:

Eachshaderunitislimitedinitsabilities,offeringcompletefunctionalitywhenusedtogether.

Figure17:BlockdiagramofShaderUnit1and2

ShaderUnit1:

Green:Acrossbarwhichdistributestheinputcomingeiterfromtherasterizerorfromtheloopback

Red:Interpolators

Yellow:Aspecialfunctionunit(forfunctionssuchasReciprocal,ReciprocalSquareRoot,etc.)

Cyan:MULchannels

Orange:Aunitfortextureoperations(notthefragmenttextureunit)

Theshaderunitcanperform2operationsperclock:

AMULona3-dimensionalvectorandaspecialfunction,aspecialfunctionandatextureoperation,or2

MULs.

TheoutputofthespecialfunctionunitcangointotheMULchannels.

ThetexturegetsinputfromtheMULunitanddoesLOD(LevelOfDetail)calculations,beforepassing

gmenttextureunitthenperformstheactualsampling

andwritesthedataintoregistersforthesecondshaderunit.

Theshaderunitcansimplypassdataaswell.

ShaderUnit2:

Red:Acrossbar

Cyan:4MULchannels

Gray:4ADDchannels

Yellow:1specialfunctionunit

Thecrossbarsplitstheinputonto5channels(4components,1channelstaysfree).

TheADDunitsareadditionallyconnected,allowingadvancedoperationssuchasadotproductinoneclock.

Again,theshaderu

specialfunctionisused,theMADunitcanperformupto2operationsfromthislist:MUL,ADD,MAD,

12

DP,oranyotherinstructionbasedontheseoperations.

Instructionset:

Somenotableinstructionsforthevertexprocessorinclude:

cmpdst,src0,src1,src2

dsxdst,src

dsydst,src

sincosdst.{x|y|xy},src0.{x|y|z|w}

texlddst,src0,src1

Choosesrc1ifsrc0>=ise,parison

isdoneperchannel

Computetherateofchangeintherendertarget’sx-direction

Computetherateofchangeintherendertarget’sy-direction

Computessineandcosine,inradians

Sampleatextureataparticularsampler,usingprovidedtexture

coordinates

Registersinthefragmentprocessorinstructionscanbemodified(withfewexceptions):

•Negatetheregistervalue

•Taketheabsolutevalueoftheregister

•Maskdestinationregistercomponents

Othertechnicaldetails:

•Thefragmentprocessorscanperformoperationswithin16or32floatingpointprecision(

unitusesonly16bitprecisionforitscalculationssincethatissufficient)

•ThequadsoperateasSIMDunits

•TheyuseVLIW

•Theyrunupto100softhreadstohidetexturefetchlatency(˜256perquad)

•Afragmentprocessorcanperformupto8operationspercycle/4mathoperationsifthere’satexture

fetchinshader1

Figure18:Possibleoperationspercycle

•Thefragmentprocessorshavea2leveltexturecache

•Thefogunitcanperformfogblendingonthefiplemented

withfixedpointprecisionsincethat’ssufficientforfogandsavesperformance.

Theequation:out=FogColor*fogFraction+SrcColor*(1-fogFraction)

13

•There’ssupportformultiplerendertargets,thepixelprocessorcanoutputtouptofourseperate

buffers(4x4values,color+depth)

3.2.3PixelEngine

Figure19:Apixelengine

Lastinthepipelinearethe16pixelengines(rasteroperators).Eachpixelengineconnectstoaspecific

helosslesscoloranddepthcompression,thedepthandcolorunits

performdepth,colorandstenciloperationsbeforewritingthefitivatedthepixelengines

alsoperformmultisampleantialiasing.

3.2.4Memory

From“GPUGems2,Chapter30:TheGeForce6SeriesGPUArchitecture”:

“Thememorysystemispartitionedintouptofourindependentmemorypartitions,each

withitsowndynamicrandom-accessmemories(DRAMs).GPUsusestandardDRAMmodules

ratherthancustomRAMtechnologiestotakeadvantageofmarketeconomiesandtherebyreduce

smaller,independentmemorypartitionsallowsthememorysubsystemtooperate

efficderedsurfaces

arestoredintheDRAMs,whiletexturesandinputdatacanbestoredintheDRAMsorin

rindependentmemorypartitionsgivetheGPUawide(256bits),

flexiblememorysubsystem,allowingforstreamingofrelativelysmall(32-byte)memoryaccesses

atnearthe35GB/secphysicallimit.”

14

3.3Performance

•425MHzinternalgraphicsclock

•550MHzmemoryclock

•256-MBmemorysize

•35.2GByte/secondmemorybandwidth

•600millionvertices/second

•6.4billiontexels/second

•12.8billionpixels/second,renderingz/stencil-only(usefulforshadowvolumesandshadowbuffers)

•6four-widefp32vectorMADsperclockcycleinthevertexshader,plusonescalarmultifunction

operation(acomplexmathoperation,suchasasineorreciprocalsquareroot)

•16four-widefp32vectorMADsperclockcycleinthefragmentprocessor,plus16four-widefp32

multipliesperclockcycle

•64pixelsperclockcycleearlyz-cull(rejectrate)

•120+Gflopspeak(equaltosix5-GHzPentium4processors)

•Upto120Wenergyconsumption(thecardhastwoadditionalpowerconnectors,thepowersources

arerecommendedtobenolessthan480W)

15

4ComputationalPrinciples

StreamProcessing:

TypicalCPUs(thevonNeumannarchitecture)suffe

verysensitivetosuchbottlenecks,andthereforeneedadifferentarchitecture,theyareessentiallyspecial

purposestreamprocessors.

Astrmisasetofdata

amprocessors,everykerneltakesoneormorestreamsasinputand

outputsoneormorestreams,whileitexecutesitsoperationsoneverysingleelementoftheinputstreams.

Instreamprocessorsyoucanachieveseverallevelsofparallelism:

•Instructionlevelparallelism:kernelsperformhundredsofinstructionsoneverystreamelement,you

achieveparallelismbyperformingindependentinstructionsinparallel

•Datalevelparallelism:kernelsperformthesameinstructionsoneachstreamelement,youachieve

parallelismbyperformingoneinstructiononmanystreamelementsatatime

•Tasklevelparallelism:Havemultiplestreamprocessorsdividetheworkfromonekernel

Streamprocessorsdonotusecachingthesamewaytraditionalprocessorsdosincetheinputdatasetsare

usuallymuchlargerthanmostcachesandthedataisbarelyreused-withGPUsforexamplethedatais

usuallyrenderedandthendiscarded.

WeknowGPUshavetoworkwithlargeamountsofdata,thecomputationsaresimplerbuttheyneed

tobefastandparallel,soitbecomesclearthatthestreamprocessorarchitectureisverywellsuitedforGPUs.

Continuingtheseideas,GPUsemployfollowingstrategiestoincreaseoutput:

Pipelining:Pipeliningdescribestheideaofbreakingdownajobintomultiplecomponentsthateachperform

epipelined,whichmeansthatinsteadofperformingcompleteprocessingofapixel

beforemovingontothenext,youfillthepipelinelikeanassemblylinewhereeachcomponentperformsa

eprocessingapixelmaytakemultipleclock

cycles,youstillachieveanoutputofonepixelperclocksinceyoufillupthewholepipe.

Parallelism:Duetothenatureofthedata-parallelismcanbeappliedonaper-vertexorper-pixelbasis

-andthetypeofprocessing(highlyrepetitive)GPUsareverysuitableforparallelism,youcouldhavean

unlimitedamountofpipelinesnexttoeachother,aslongastheCPUisabletokeepthembusy.

OtherGPUcharacteristics:

•GPUscanaffordlargeamountsoffloatingpointcomputationalpowersincetheyhavelowercontrol

overhead

•Theyusededicatedfunctionalunitsforspecializedtaskstoincreasespeeds

•GPUmemorystruggleswithbandwidthlimitations,andthereforeaimsformaximumbandwidthusage,

employingstrategieslikedatacompression,multiplethreadstocopewithlatency,schedulingofDRAM

cyclestominimizeidledata-bustime,etc.

•Cachesaredesignedtosupporteffectivestreamingwithlocalreuseofdata,ratherthanimplementing

acachethatachieves99%hitrates(whichisn’tfeasible),GPUcachedesignsassumea90%hitrate

withmanymissesinflight

•GPUshavemanydifferentperformanceregimesallwithdifferentcharacteristicsandneedtobede-

signedaccordingly

16

4.1TheGeforce6800asageneralprocessor

YoucanseetheGeforce6800asageneralprocessorwithalotoffloating-pointhorsepowerandhighmemory

bandwidththatcanbeusedforotherapplicationsaswell.

Figure20:AgeneralviewoftheGeforce6800architecture

LookingattheGPUthatway,weget:

•2seriallyrunningprogrammableblockswithfp32capability.

•TheRasterizercanbeseenasaunitthatexpandsthedataintointerpolatedvalues(fromonedata-

”point”tomultiple”fragments”).

•WithMRT(MultipleRenderTargets),thefragmentprocessorcanoutputupto16scalarfloating-point

valuesatatime.

•Severalpossibilitiestocontrolthedataflowbyusingthevisibilitychecksofthepixelenginesorthe

Z-cullunit

17

5Thenextstep:theGeforce8800

AftertheGeforce7serieswhichwasacontinuationoftheGeforce6800architecture,Nvidiaintroducedthe

bythedesiretoincreaseperformance,improveimagequalityandfacilitate

programming,theGeforce8800presentedasignificantevolutionofpastdesigns:aunifiedshaderarchitec-

ture(Note,thatATIalreadyusedthisarchitecturein2005withtheXBOX360GPU).

Figure21:Fromdedicatedtounifiedarchitecture

Figure22:AschematicviewoftheGeforce8800

TheunifiedshaderarchitectureoftheGeforce8800essentiallyboilsdowntothefactthatallthedifferent

shaderstagesbecomeonesinglestagethatcanhandleallthedifferentshaders.

AsyoucanseeinFigure22,insteadofdifferentdedicatedunitswenowhaveasinglestreamingprocessor

familiarunitssuchastherasteroperators(blue,atthebottom)andthetrianglesetup,

stheseunitswenowhaveseveralmanagingunitsthatprepareand

managethedataasitflowsintheloop(vertex,geometryandpixelthreadissue,inputassemblerandthread

processor).

18

Figure23:Thestreamingprocessorarray

xtureprocessorclusterin

turnconsistsof2streamingmultiprocessorsand1texturepipe.

Astreamieamingpro-

cessorsworkwith32-bitscalardata,basedontheideathatshaderprogramsarebecomingmoreandmore

scalar,makingavectorarchitecturemoreineffiedrivenbyahigh-speedclockthatisseperate

ltiprocessorcan

have768hardwarescheduledthreads,groupedtogetherto24SIMD”warps”(Awarpisagroupofthreads).

Thetexturepipeconsistsof4textureaddressingand8texturefiormstextureprefetching

andfilteringwithoutconsumingALUresources,furtherincreasingefficency.

Itmple,theold

problemofconstantlychangingworkloadandoneshaderstagebecomingaprocessingbottleneckissolved

sincetheunitscanadaptdynamically,nowthattheyareunified.

Figure24:Workloadbalancingwithbotharchitectures

Withasingleinstructionsetandthesupportoffp32throughoutthewholepipeline,aswellasthesupport

ofnewdatatypes(integercalculations),programmingtheGPUnowbecomeseasieraswell.

19

6GeneralPurposeProgrammingontheGPU-anexample

WeusethebitonicmergesortalgorithmasanexampleforefficientlyimplementingalgorithmsonaGPU.

Bitonicmergesort:

Bitonicmergesortworksbyrepeatedlybuildingbitoniclistsoutofasetofelementsandsortingthem.A

bitoniclistisaconcatenationoftwomonotoniclists,oneascendingandonedescending.

E.g.:

ListA=(3,4,7,8)

ListB=(6,5,2,1)

ListAB=(3,4,7,8,6,5,2,1)isabitoniclist

ListBA=(6,5,2,1,3,4,7,8)isalsoabitoniclist

Bitoniclistshaveaspecialpropertythatisusedinbitonicmergesort:Supposeyouhaveabitoniclistof

rearrangethelistsothatyougettwohalveswithnelementswhereeachelement(i)of

thefirsthalfislessthanorequaltoeachcorrespondingelement(i+n)inthesecondhalf(orgreaterthanor

equal,ifthelistdescendsfirstandthenascends)ppensbycom-

ocedureiscalledabitonicmerge.

Bitonicmergesortworksbyrecursivelycreatingandmergingbitonicliststhatincreaseintheirsizeuntil

25illustratestheprocess:

Figure25:Thedifferentstagesofthealgorithm

Thesortingprw

sultsinbitonicmergesorthavingacomplexityof

O(nlog

2

(n)+log(n))whichisworsethanquicksort,butthealgorithmhasnoworst-casescenario(where

quicksortreachesO(n

2

).

theoperationscanbeperformedinparalleland

thelengthstaysconstant,nimplementingthisalgorithmon

theGPU,wewanttomakeuseofasmanyresourcesaspossible(bothinparallelaswellasverticallyalong

20

thepipeline),especiallyconsideringthattheGPUhasshortcomingsaswell,suchasthelackofsupport

mple,simplylettingthefragmentprocessorstagehandleallthe

calculationsmightwork,ble

solutionlookslikethis:

Inthisalgorithm,wehavegroupsofelements(fragments)thathavethesamesortingconditions,while

edrawavertexquadovertwoadjacentgroupsand

setappropriateflagsateachcorner,

example,ifwesettheleftcornersto-1andtherightcornersto+1,wecancheckwhereafragmentbelongs

tobysimplylookingatitssign(theinterpolationprocesstakescareofthat).

Figure26:Usingvertexflagsandtheinterpolatortodeterminecompareoperations

Next,weneedtodeterminewhichcompareoperationtouseandweneedtolocatethepartneritemto

nagainbeaccomplishedbyusingtheflgthecompareoperationtoless-than

andmultiplyingwiththeflagvalueimplicitlyflipstheoperationtogreater-equalhalfwayacrossthequad.

Locatingthepartneritemhappensbymultiplyingthesignoftheflagwithanabsolutevaluethatspecifies

thedistancebetweentheitems.

Inordertosortelementsofanarray,westorethemina2Dtexture.

Eachrowisatend

thequadsovertherowsofthe2Dtextureandusetheinterpolation,wecanmodulatethecomparisonsothe

y,pairsofrowsbecomebitonic

sequencesagainwhichcanbesortedinthesamewaywesortedthecolumnsofthesinglerows,simplyby

transposingthequads.

Asafinaloptimizationwereducetexturefetchingbypackingtwoneighbouringkeypairsintoonefrag-

ment,sincetheshaderoperateson4-vectors.

Performancecomparison:

std:sort:16-BitData,

Pentium43.0GHz

NFullSorted

Sorts/SecKeys/Sec

2

25682.55.4M

2

51220.65.4M

2

10244.75.0M

BitonicMergeSort:16-BitFloatData,

NVIDIAGeforce6800Ultra

NPassesFullSorted

Sorts/SecKey/Sec

2

25612090.076.1M

2

51215318.34.8M

2

10241903.63.8M

21

GLSL(OpenGLShadingLanguage)codesample,implementingthecombinedpasses0and1forrow-wise

sortingofthebitonicmergesort:

uniformsampler2DPackedData:

//contentsofthetexcoorddata

#defineOwnPosgl_TexCoord[0].xy

#defineSearchDirgl_TexCoord[0].z

#defineCompOpgl_TexCoord[0].w

#defineDistancegl_TexCoord[1].x

#defineStridegl_TexCoord[1].y

#defineHeightgl_TexCoord[1].z

#defineHalfStrideMHalfgl_TexCoord[1].w

voidmain(void)

{

//getself

vec4self=texture2D(PackedData,OwnPos);

//restoresignofsearchdirectionandassemblevectortopartner

vec2adr=vec2((SearchDir<0.0)?-Distance:Distance,0.0);

//getthepartner

vec4partner=texture2D(PackedData,OwnPos+adr);

//switchascending/descendingsortforeveryotherrow

//bymodifyingcomparisonflag

floatcompare=CompOp*-(mod(floor(gl_TexCoord[0].y*Height),Stride)-HalfStrideMHalf);

//xandyarethekeysofthetwoitems

//-->multiplywithcomparisonflag

vec4keys=compare*vec4(self.x,self.y,partner.x,partner.y);

//comparethekeysandstoreaccordingly

//zandwaretheindices

//-->justcopythemaccordingly

vec4result;

=(keys.x

=(keys.y

//dopass0

compare*=adr.x;

gl_FragColor=(result.x*compare

}

22

7Currentandfuturedevelopments

Nvidia’scurrenttopofthelinemodelofgraphicscardsistheGeforceGTX280(GTX200series),an

evolutionoftheunifiedshaderarchitectureoftheGeforce8800,sportingalmostdoubletheshadercount

(from128to240)nchdatewasthe17thofJune2008.

ATI(nowmergedintoAMD)followedin2007withitsfirstunifiedshaderGPUforthePC(Radeon

HD2900XT),renttop

ofthelinemodelistheRadeonHD3870X2(whichactuallysports2GPUsononecard)whichwasreleased

inJanuary2008.

ATI/AMDaresoontofollowupwithananswertoNvidia’sGTX280:theRadeonHD4870(slatedfor

somewherearoundJuly2008).

Withtheadventoftheunifiedshaderarchitecture,thetopicofgeneral-purposecomputingonaGPUhas

,GPUshavemadetheirwayintonon-graphicsfieldsasvariedas

audiosignalprocessingandweatherforecasting.

BothATI/AMDandNvidiahavemadeeff

releasedCTM(CloseToMetal),ewritingthesoftware,CTM’s

commercialsuccessorAMDStreamSDKwasreleasedinDecember2007,nowprovidingadditionalhighlevel

toolsforgeneral-purposeaccesstoAMDgraphicshardware.

NvidiainitiallyreleasedtheCUDA(ComputeUnifiedDeviceArchitecture)SDKinFebruary2007,aC

lauary2008,Nvidia

boughtAgeiaandtheirPhysXengine(aproprietaryrealtimephysicsenginemiddlewareSDK)andinte-

grateditintotheirCUDAframework.

tuallyscalewell

beyondMoore’slaw,charapiddevelopmentwecan

certainlyexpecttoseequitesomeinterestingthingstocomeinthisfieldofprocessing.

References

[1]WikipediaentryonGPUs

/wiki/GPU

[2]KeesHuizing,Han-WeiShen:“TheGraphicsRenderingPipeline”

/

~

keesh/ow/2IV40/

[3]CyrilZeller:“IntroductiontotheHardwareGraphicsPipeline”,PresentationatACMSIGGRAPH

2005

/developer/presentations/2005/I3D/I3D_05_

[4]ExtremeTech3DPipelineTutorial

/article2/0,1697,9722,

[5]AshuRege:“Introductionto3DGraphicsforGames”

/docs/IO/11278/

[6]DirectXDeveloperCenter:“TheDirect3DTransformationPipeline”

/en-us/library/bb206260(VS.85).aspx

[7]MarkColbert:“GPUArchitecture&CG”

/gpuseminar/

[8]GPUGems2,Chapter30:“TheGeForce6SeriesGPUArchitecture”

/developer/GPU_Gems_2/GPU_Gems2_

[9]IEEEMicro,Volume25,Issue2(March2005):“TheGeForce6800”

/?id=1069760

[10]:“NV40-TechnikimDetail”

/artikel/nv40_pipeline/

23

[11]:“NVIDIAGeForce6800Ultra(NV40)”

/articles2/gffx/

[12]AustinRobison,AbeWinter:“AnOverviewofGraphicsProcessingHardware”

/

~

robison/src/gpu_

[13]JohnMontrym,HenryMoreton:“NVIDIAGeForce6800”,HotChips16

/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_

[14]AjitDatar,ApurvaPadhye:“GraphicsProcessingUnitArchitecture”

/

~

data0003/Talks/

[15]SvenSchenk:“EineEinfuehrungindieArchitekturmodernerGraphikprozessoren”

/Lehre/Seminar0506/

[16]ThomasScottCrow:“EvolutionoftheGraphicalProcessingUnit”

/

~

fredh/papers/thesis/023-crow/

[17]DirectXDeveloperCenter:“AsmShaderReference”

/en-us/library/bb219840(VS.85).aspx

[18]ErikLindholm,StuartOberman:“NVIDIAGeForce8800GPU”

/archives/hc19/2_Mon/HC19.02/

[19]:“SayHelloToDirectX10,Or128ALUsInAction:NVIDIAGeForce8800GTX(G80)”

/articles2/video/

[20]RichardHough,RichardYu:“GPUArchitecture”

/courses/ece685/slides/

[21]TechnicalBrief:“NVIDIAGeForce8800GPUArchitectureOverview”

/object/IO_

[22]GPUGems2,Chapter46:“ImprovedGPUSorting”

[23]TimPurcell:“SortingandSearching”,SIGGRAPH2005GPGPUCOURSE

/s2005/slides/

[24]PeterKipfer,MarkSegal,RuedigerWestermann:“UberFlow:AGPU-BasedParticleEngine”

/previous/www_2004/Presentations/

[25]WikipediaentryonNvidia

/wiki/Nvidia_Corporation

[26]WikipediaentryonATI

/wiki/ATI_Technologies_Inc.

[27]WikipediaentryonCUDA

/wiki/CUDA

[28]WikipediaentryonCTM

/wiki/Close_to_Metal

[29]WilliamMark,HenryMoreton:“3DGraphicsArchitectureTutorial”

/users/billmark/talks/Graphics_Arch_Tutorial_Micro2004_

24

发布评论

评论列表 (0)

  1. 暂无评论