超標(biāo)量流水線

上傳人：卓*** IP屬地：廣東上傳時(shí)間：2022-09-23 格式：PPT 頁(yè)數(shù)：89 大小：5.67MB 積分：18 舉報(bào) 版權(quán)申訴

已閱讀5頁(yè)，還剩84頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、超標(biāo)量流水線第1頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Pipelining to SuperscalarForecastLimits of pipeliningThe case for superscalarInstruction-level parallel machinesSuperscalar pipeline organizationSuperscalar pipeline design第2頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Limits of PipeliningIBM RISC Experience（P91，Tilak Agerwala a

2、nd John Cocke，1987）（原理性問(wèn)題）Control and data dependences add 15%Best case CPI of 1.15, IPC of 0.87Deeper pipelines (higher frequency) magnify dependence penaltiesThis analysis assumes 100% cache hit rates（存儲(chǔ)問(wèn)題）Hit rates approach 100% for some programsMany important programs have much worse hit ratesLa

3、ter!第3頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Processor Performance（P17）In the 1980s (decade of pipelining):CPI: 5.0 = 1.15In the 1990s (decade of superscalar):CPI: 1.15 = 0.5 (best case)Processor Performance = -Time ProgramInstructions Cycles ProgramInstructionTimeCycle (code size)=XX (CPI) (cycle time)第4頁(yè)，共8

4、9頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Amdahls Law（P18）h = fraction of time in serial codef = fraction that is vectorizablev = speedup for fOverall speedup:No. ofProcessorsNTime1h1 - h1 - ff第5頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Revisit Amdahls LawSequential bottleneckEven if v is infinitePerformance limited by nonvector

5、izable portion (1-f)No. ofProcessorsNTime1h1 - h1 - ff第6頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Pipelined Performance Model（Harold Stone，1987，P19）g = fraction of time pipeline is filled1-g = fraction of time pipeline is not filled (stalled)三個(gè)階段: 第一：N條指令進(jìn)入流水線第二：流水線充滿階段，假定沒(méi)有流水線干擾引起的停頓，此時(shí)是流水線最優(yōu)的性能第三：流水線排空階段，沒(méi)有新

6、指令進(jìn)入流水線，當(dāng)前正在流水線中的指令完成執(zhí)行第7頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Pipelined Performance ModelTyranny of Amdahls Law Bob ColwellWhen g is even slightly below 100%, a big performance hit will resultStalled cycles are the key adversary and must be minimized as much as possible1-ggPipelineDepthN1第8頁(yè)，共89頁(yè)，2022年，5月20

7、日，17點(diǎn)37分，星期三Motivation for SuperscalarAgerwala and Cocke(P23)Typical RangeSpeedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar)第9頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Superscalar ProposalModerate tyranny of Amdahls LawEase sequential bottleneckMore generally applicableRobust (less sens

8、itive to f)Revised Amdahls Law:第10頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Limits on Instruction Level Parallelism (ILP)Weiss and Smith 19841.58Sohi and Vajapeyam 19871.81Tjaden and Flynn 19701.86 (Flynns bottleneck)Tjaden and Flynn 19731.96Uht 19862.00Smith et al. 19892.00Jouppi and Wall 19882.40Johnson 19912.

9、50Acosta et al. 19862.79Wedig 19823.00Butler et al. 19915.8Melvin and Patt 19916Wall 19917 (Jouppi disagreed)Kuck et al. 19728Riseman and Foster 197251 (no control dependences)Nicolau and Fisher 198490 (Fishers optimism)Variance due : benchmarks, machine models, cache latency & hit rate, compilers,

10、religious bias, gen. Purpose vs. special purpose/scientific, C vs. FortranNot monotonic with time第11頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Superscalar ProposalGo beyond single instruction pipeline, achieve IPC 1Dispatch multiple instructions per cycleProvide more generally applicable form of concurrency (not

11、just vectors)Geared for sequential code that is hard to parallelize otherwiseExploit fine-grained or instruction-level parallelism (ILP)Not 100 or 1000 degree of parallelism, but 2-3-4Fine-grained vs. medium-grained (loop iterations) vs. coarse-grained (threads)第12頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Classi

12、fying ILP MachinesJouppi, DECWRL 1991Baseline scalar RISCIssue parallelism = IP = 1Operation latency = OP = 1Peak IPC = 1IP = max instructions/cycleOp latency = # cycles till result availableIssuing latency第13頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Classifying ILP MachinesJouppi, DECWRL 1991Superpipelined: cyc

13、le time = 1/m of baselineIssue parallelism = IP = 1 inst / minor cycleOperation latency = OP = m minor cyclesPeak IPC = m instr / major cycle (m x speedup?)第14頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Classifying ILP MachinesJouppi, DECWRL 1991Superscalar:Issue parallelism = IP = n inst / cycleOperation latency

14、= OP = 1 cyclePeak IPC = n instr / cycle (n x speedup?)第15頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Classifying ILP MachinesJouppi, DECWRL 1991VLIW: Very Long Instruction WordIssue parallelism = IP = n inst / cycleOperation latency = OP = 1 cyclePeak IPC = n instr / cycle = 1 VLIW / cycleCharacteristics:-paralle

15、lism packaged by compiler, hazards managed by compiler, no (or few) hardware interlocks, low code density (NOPs), clean/regular hardware第16頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Classifying ILP MachinesJouppi, DECWRL 1991Superpipelined-SuperscalarIssue parallelism = IP = n inst / minor cycleOperation latency

16、= OP = m minor cyclesPeak IPC = n x m instr / major cycle第17頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Superscalar vs. SuperpipelinedRoughly equivalent performanceIf n = m then both have about the same IPCParallelism exposed in space vs. time第18頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Superpipelining: Result Latency第19頁(yè)，共89頁(yè)

17、，2022年，5月20日，17點(diǎn)37分，星期三Superscalar Challenges第20頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Limitations of Scalar PipelinesScalar upper bound on throughputIPC = 1Inefficient unified pipelineLong latency for each instructionRigid pipeline stall policyOne stalled instruction stalls all newer instructions第21頁(yè)，共89頁(yè)，20

18、22年，5月20日，17點(diǎn)37分，星期三Parallel PipelinesTemporal vs. Spatial vs. Both第22頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Intel Pentium Parallel Pipeline486 pipeline on the left: 2 decode stages due to complex ISAPentium paralle pipeline: U pipe is universal (can handle any op), V pipe cant handle the most complex opsStag

19、es: Fetch and align, decode & generate control word, decode control word & gen mem addr, ALU or D$Used branch prediction第23頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Diversified PipelinesUnified pipelines are inefficient and unnecessary. In a scalar organization they make sense.With multiple issue, specialized pi

20、pelines make much more sense.Note that all instructions are treated identically in IF, ID, (also RD, more or less), and WB. Why? Because they behave very much the same way.第24頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Power4 Diversified PipelinesPCI-CacheBR ScanBR PredictFetch QDecodeReorder BufferBR/CRIssue QCRU

21、nitBRUnitFX/LD 1Issue QFX1UnitLD1UnitFX/LD 2Issue QLD2UnitFX2UnitFPIssue QFP1UnitFP2UnitStQD-CacheIBM Power4, introduced in 2001. Leadership performance at 1.3GHz第25頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Rigid Pipeline Stall Policy Bypassing of StalledInstructionStalled InstructionBackward Propagationof Stall

22、ingNot Allowed第26頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Dynamic PipelinesIn-order front end, dynamic execution in a micro-dataflow-machine, in-order backendInterlock hardware (later) maintains dependencesReorder buffer tracks completion, exceptions, provides precise interrupts: drain pipeline, restartInorder

23、machine state follows the sequential execution model inherited from nonpipelined/pipelined machines第27頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Interstage BuffersScalar pipe: just pipeline latches or flip-flopsIn-order superscalar pipe: just wider onesOut-of-order: start to look more like register files, with ra

24、ndom access necessary, or shift registers.May require effective crossbar between slots before/after bufferMay need to be a multiported CAM第28頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Superscalar Pipeline StagesIn Program OrderIn Program OrderOutofOrder第29頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Limitations of Scalar Pipelin

25、esScalar upper bound on throughputIPC = 1Solution: wide (superscalar) pipelineInefficient unified pipelineLong latency for each instructionSolution: diversified, specialized pipelinesRigid pipeline stall policyOne stalled instruction stalls all newer instructionsSolution: Out-of-order execution, dis

26、tributed execution pipelines第30頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三幾種典型的超標(biāo)量處理器90年代初，超標(biāo)量處理器開(kāi)始用雙流出處理器。在同一時(shí)鐘周期內(nèi)提供多條指令的取指、譯碼、流出、執(zhí)行、寫(xiě)回操作。第一個(gè)成功的商用超標(biāo)量微處理器，Intel i960 RISC處理器，在1990年投入市場(chǎng)。第一代雙流出超標(biāo)量RISC處理器有Motorola 88110，Alpha 21064、HP PA-7100和Pentium。第31頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三幾種典型的超標(biāo)量處理器90年代中期有：IBM POWER2 RISC

27、 System/6000處理器，PowerPC 601、603、604、750(G4)、620、IBM POWERDEC Alpha 21164 、Alpha 21264Sun UltraSPARC、 UltraSPARC-II、IIi、IIIHP PA-8000， PA-8500MIPS R10000。MIPS R120004流出和6流出第32頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三幾種典型的超標(biāo)量處理器超標(biāo)量微處理器占主導(dǎo)地位的Intel，生產(chǎn)Intel x86 ISA系列產(chǎn)品：1993年的雙流出Pentium處理器Pentium Pro、Pentium II，它的新一代

28、Celeron、Pentium III 、 Pentium 4Intel微處理器由于其ISA特性而被認(rèn)為是CISC微處理器有些公司還設(shè)計(jì)了與Intel兼容的處理器如AMD的K5、K6、K6-2和K6-3，Cyrix的6x86、M II和M XICISC微處理器有附加的流水段，從x86指令集產(chǎn)生一種叫做RISC86操作或微操作，因此它們就有比超標(biāo)量RISC處理器更復(fù)雜的流水線。第33頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三I-cacheD-cacheBusInter-faceUnit BranchUnit Instruction Fetch UnitReorder Buffer

29、InstructionIssue Unit RetireUnit Load/ StoreUnit IntegerUnit(s) Floating-PointUnit(s) RenameRegisters General PurposeRegistersFloating- PointRegistersBTACBHTMMUMMU32 (64)DataBus 32 (64)AddressBusControlBus Instruction BufferInstruction Decode andRegister Rename Unit Components of a Superscalar Proce

30、ssor第34頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Components of a Superscalar Processor超標(biāo)量RISC微處理器的體系結(jié)構(gòu)通常具有32位定長(zhǎng)指令的Load/Store體系結(jié)構(gòu)。處理器包含以下單元：取指單元（含分支單元）譯碼單元寄存器重命名單元流出單元幾個(gè)獨(dú)立的執(zhí)行功能部件(FUs)第35頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Components of a Superscalar Processor指令退出單元32個(gè)通用寄存器，32個(gè)浮點(diǎn)寄存器，附加的重命名物理寄存器總線接口和外部存儲(chǔ)器總線與二級(jí)cache相連指令cac

31、he數(shù)據(jù)cache附加的內(nèi)部緩沖器（如指令緩沖器和重排序緩沖器）第36頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三功能部件裝載/存儲(chǔ)單元浮點(diǎn)單元整數(shù)單元多媒體單元分支單元功能部件的類型和數(shù)量取決于特定的處理器。第37頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Superscalar Pipeline DesignInstruction Fetching IssuesInstruction Decoding IssuesInstruction Dispatching IssuesInstruction Execution IssuesInstruction Comp

32、letion & Retiring Issues第38頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Instruction FlowChallenges:Branches: control dependencesBranch target misalignmentInstruction cache missesSolutionsCode alignment (static vs.dynamic)Prediction/speculationInstruction MemoryPC3 instructions fetchedObjective: Fetch multiple instr

33、uctions per cycleDont starve the pipeline: n/cycleMust fetch n/cycle from IF第39頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三I-Cache Access and Instruction FetchHarvard architecture: separate instruction and data memory and access pathsThe I-cache is less complicated to control than the D-cache, because it is read-o

34、nly it is not subjected to cache coherence in contrast to the D-cacheMESI協(xié)議只有share and Invalid兩位Sometimes the instructions in the I-cache are predecoded on their way from the memory interface to the I-cache to simplify the decode stage(PowerPC 620)第40頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Instruction Fetch(1)

35、指令獲取部件的主要問(wèn)題是處理諸jump、branch, call, return, and interrupt 指令順序取指的過(guò)程將被中斷此中斷過(guò)程可發(fā)生在某個(gè)取指Block的中間或者剛剛結(jié)束的時(shí)刻，該中斷點(diǎn)的后續(xù)指令都需要作廢Wallace and Bagherzadeh證明：在一個(gè)8流出的超標(biāo)量結(jié)構(gòu)中，簡(jiǎn)單的硬件取指每拍取到的有效指令不超過(guò)4條（SPECint95）如果PC指針指向的起始地址不是一個(gè)Cache line的地址，則只需要將小于取指寬度的必要指令返回給譯碼部件。如果取指包包含分支指令，則分支指令后的指令自動(dòng)無(wú)效第41頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三In

36、struction Fetch(2)A multiple cache lines fetch from different locations may be needed in future (取多寬？)very wide-issue processors where often more than one branch will be contained in a single contiguous fetch blockEager excution of both sides of branchMultithreaded processors第42頁(yè)，共89頁(yè)，2022年，5月20日，17

37、點(diǎn)37分，星期三Instruction Fetch(3)另一個(gè)問(wèn)題：目標(biāo)指令的地址可能與Cache line的地址不對(duì)齊（取哪里？）通過(guò)Self-aligned指令cache實(shí)現(xiàn)硬件解決方案一個(gè)周期內(nèi)連續(xù)讀相鄰的兩個(gè)Cache行確保取指帶寬能夠被滿足Implementation:either by use of a dual-port I-cache, by performing two separate cache accesses in a single cycleor by a two-banked I-cache (preferred). 第43頁(yè)，共89頁(yè)，2022年，5月20日，1

38、7點(diǎn)37分，星期三Prefetching and Instruction Fetch PredictionPrefetching improves the instruction fetch performance, but fetching is still limited because instructions after a control transfer must be invalidatedInstruction fetch prediction helps to determine the next instructions to be fetched from the mem

39、ory subsystemInstruction fetch prediction is applied in conjunction with branch prediction.新的基于預(yù)測(cè)的指令Cache替換算法？指令Cache訪問(wèn)主存地址流的分析？第44頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三I-Cache OrganizationRow DecoderCacheLineTAGTAGAddress1 cache line = 1 physical rowCacheLineTAGTAGAddress1 cache line = 2 physical rowsTAGTAG

40、Row Decoder 阻礙每拍獲得最大指令數(shù)的兩個(gè)因素 Fetch Alignment The presence of control-flow changing instructions in the fetch group第45頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Fetch AlignmentFetch size n=4: losing fetch bandwidth if not aligned第46頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Solution for Fetch Misalignment ProblemStatic/compiler

41、: align branch targets at 00 (may need to pad with NOPs) implementation specificUsing hardware at run time第47頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三RIOS-I Fetch Hardware1989 design used in the first IBM RS/6000 (POWER or RIOS-I): 第48頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三RIOS-I Fetch Hardware（1）1989 design used in the

42、first IBM RS/6000 (POWER or RIOS-I): 4-wide machine with Int, FP, BR, CR (typically 2 or fewer issue)2-way set-assoc, linesize 64B spans 4 physical rows, each instruction word interleavedSay fetch i is B10, i+1 is B11, i+2 is B12, i+3 is B13.T-logic detects misalignment and chooses appropriate index

43、第49頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三RIOS-I Fetch Hardware（2）I-buffer network rotates instructions so they leave in program order“Interleaved sequential” improves by interleaving tag array; allows combining of ops from two cache lines. If both hit, can get 4 every cycle.第50頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Is

44、sues in DecodingPrimary TasksIdentify individual instructions (!)Determine instruction typesDetermine dependences between instructionsTwo important factorsInstruction set architecturePipeline widthRISC vs. CISCRISC: fixed length, regular format, easierCISC: can be multiple stages (lots of work), P6:

45、 I$ = decode is 5 cycles, often translates into internal RISC-like uops or ROPs第51頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Decode StageSuperscalar processor: 有序組織的前端(In-Order Issue Front-end)單元，亂序內(nèi)核(Out-of-Order Core)單元和有序的退出(In-Order Retirement)單元 Instruction delivery: 流水線的取指段和譯碼段比執(zhí)行段具有較高的帶寬。 Delivery task: 保持

46、指令窗的始終處于充滿狀態(tài)預(yù)取指令越深，則允許更多的指令發(fā)射給各功能單元。指令預(yù)取和譯碼的數(shù)量大概是指令執(zhí)行后被最終確認(rèn)的數(shù)量的1.4倍到2倍because of mispredicted branch paths通常情況下，指令預(yù)取寬度與指令譯碼寬度相等第52頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Decoding variable-length instructions固定指令長(zhǎng)度的微處理器一般支持多指令預(yù)取和譯碼Variable instruction length:CISC instruction sets as the Intel X86 ISA. a multist

47、age decode is necessary. 第一棧定界：處理判斷指令流里面的指令邊界。并將確定長(zhǎng)度的指令發(fā)送給第二棧。第二棧譯碼微操作：對(duì)每條指令進(jìn)行譯碼，生成一條或者多條微操作AMD K系列：復(fù)雜CISC指令集結(jié)構(gòu)Complex CISC instructions are split into micro-ops which resemble ordinary RISC instructions.微操作可以是數(shù)條簡(jiǎn)單指令，或者一個(gè)簡(jiǎn)單指令構(gòu)成的指令流。CISC指令集相比與RISC指令集：優(yōu)點(diǎn)：有更高的指令密度缺點(diǎn)：指令譯碼結(jié)構(gòu)更加復(fù)雜第53頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)3

48、7分，星期三Pentium Pro Fetch/Decode16B/cycle delivered from I$ into FIFO instruction buffer Decoder 0 is fully general, 1 & 2 can handle only simple uops. Enter centralized window (reservation station); wait here until operands ready, structural hazards resolved. Why is this bad? Branch penalty; need a g

49、ood branch predictor.Other option: predecode bits第54頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Pre-decoding如果指令操作碼允許，取指段就可以分析部分操作，并利用它進(jìn)行預(yù)測(cè)。Pre-decode :transferred from memory to the I-cache.the decode stage is more simple.MIPS R10000: 對(duì)32位指令進(jìn)行預(yù)譯碼，形成36位格式存儲(chǔ)在指令CACHE中。4位擴(kuò)展位指示將使用哪一個(gè)功能單元執(zhí)行該條指令。對(duì)每條指令的操作數(shù)選擇域和目的寄存器選擇域進(jìn)行重

50、排，使之存儲(chǔ)在同樣的位置，修改操作碼以簡(jiǎn)化整數(shù)或者浮點(diǎn)目的寄存器譯碼。譯碼器對(duì)這類擴(kuò)展后的指令譯碼速度遠(yuǎn)遠(yuǎn)高于對(duì)原來(lái)的指令格式第55頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Predecoding in the AMD K5K5: notoriously late and slow, but still interesting (AMDs first non-clone x86 processor)50% larger I$, predecode bits generated as instructions fetched from memory on a cache miss

51、:Powerful principle in architecture: memoization!Predecode records start and end of x86 ops, # of ROPs, location of opcodes & prefixesUp to 4 ROPs per cycle.Also useful in RISCs: PPC 620 used 7 bits/inst PA8000, MIPS R10000 used 4/5 bits/instThese used to ID branches early, reduce branch penalty第56頁(yè)

52、，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Instruction Dispatch and IssueParallel pipelineCentralized instruction fetchCentralized instruction decodeDiversified pipelineDistributed instruction execution第57頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Issue and DispatchThe instruction window ：譯碼段和執(zhí)行段之間所有的等待站組成. 流水線中，指令窗將執(zhí)行段和譯碼段隔離開(kāi)來(lái)

53、，但并不是流水線的附加階段。Instruction issue: 微處理器中的功能部件的指令執(zhí)行初始化過(guò)程。issue to a FU or a reservation stationdispatch, if a second issue stage exists to denote when an instruction is started to execute in the functional unit.指令流出策略就是用于流出指令的約定微處理器的“向前看”的能力，就是檢查當(dāng)前執(zhí)行點(diǎn)以外希望找到不相關(guān)指令去執(zhí)行，允許后續(xù)不相關(guān)指令發(fā)往執(zhí)行第58頁(yè)，共89頁(yè)，2022年，5月20日，17

54、點(diǎn)37分，星期三Necessity of Instruction DispatchMust have complex interstage buffers to hold instructions to avoid rigid pipeline第59頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Instruction Window Organizations（3-1）a central instruction window對(duì)應(yīng)于單段流出流向所有功能單元的所有指令置于一個(gè)共同的指令窗口緩沖器。缺點(diǎn)：從一個(gè)大的中央指令窗流出指令限制了微處理器主頻的提高。更新操作的能力，相關(guān)資源（功能單

55、元選擇，重排緩沖選擇）檢測(cè)的能力指令窗越大，更新和選擇的復(fù)雜度增加的越快。第60頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Instruction Window Organizations（3-2）解決方案：multi-stage issue: Operand availability and resource availability checking is split into two separate stages.資源相關(guān)流出先進(jìn)入保留站（對(duì)應(yīng)于每一個(gè)功能單元或者每一組功能單元）。當(dāng)操作數(shù)準(zhǔn)備就緒，允許執(zhí)行時(shí)進(jìn)入第二站，可以派發(fā)給各功能單元。decoupling of i

56、nstruction windows: 提供一組指令窗或者保留站Each instruction window is shared by a group of (usually related) functional units. most common: separate floating-point window and integer window第61頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Instruction Window Organizations(3-3)combination of multi-stage issue and decoupling of in

57、struction windows從指令窗流出的指令可以是順序流出也可以是亂序流出In a two-stage issue scheme，with resource dependent issue preceding the data-dependent dispatchthe first stage is done in-orderthe second stage is performed out-of-order. 第62頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Functional UnitsIssue and DispatchDecode and RenameThe c

58、ommon issue schemesSingle-level, central issue single-level issue out of a central window as in Pentium II processor第63頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Decode and RenameFunctional UnitsIssue and DispatchFunctional UnitsSingle-level, two-window issueSingle-level, two-window issuesingle-level issue with a

59、 instruction window decoupling using two separate windows most common: separate floating point and integer windows as in HP 8000 processor第64頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Decode and RenameDispatchIssueFunctional UnitFunctional UnitFunctional UnitFunctional UnitReservation StationsTwo-level issue with

60、 multiple windowsTwo-level issue with multiple windows with a centralized window in the first stage and separate windows in the second stage (PowerPC 604 and 620 processors).第65頁(yè)，共89頁(yè)，2022年，5月20日，17點(diǎn)37分，星期三Centralized Reservation StationDispatch: based on type; Issue: when instruction enters functio

人人文庫(kù)> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

超標(biāo)量流水線

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

超標(biāo)量流水線

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔