Industrial Software Engineering

Transkrypt

Industrial Software Engineering
Pięcio-krokowy pipeline
 Każdy cykl zegara staje się jednym „krokiem pipeline”
 Kroki mogą być wykonywane równolegle
 Mimo, że wykonanie instrukcji zabiera 5 cykli zegara, liczba CPI
zmienia się z 5 na 1
 Czy to jest takie proste ?
1
1
Realizacja Pipeline
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
Czas
IM
DM
REG
IM
DM
REG
IM
REG
DM
REG
IM
REG
DM
REG
IM
REG
REG
REG
DM
2
REG
2
Three observations
It is not so simple !
What can happens on every clock cycle ?
 For example: a single ALU cannot be asked to compute an effective
address and perform an arithmetic operation at the same time.
 Happily the major functional units are used in different cycles, and hence
overlapping the execution of multiple instructions introduced relatively
few conflicts. There are three observation on which this fact rests:
 We use separate instruction and data memories (implemented
typically as caches)
 The register file is used in the two stages: one for reading in ID
and one for writing in WB
 To start a new instruction every clock, we must increment and
store the PC every clock, and this must be done during the IF
stage in preparation for the next instruction
3
3
Page 1
1
The pipeline data paths – skeleton used
EX/MEM MEM/WB
im
(pipeline registers)
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
4
4
Basic Performance Issues
Pipelining increases the CPU instruction throughput but it does
not reduce the execution time of an individual instruction,
however a program runs faster instruction throughput,pipeline
latency).
An Example: Let’s consider the unpipelined processor with 1 ns
clock cycle which uses 4 cycles for ALU operations and branches
and 5 cycles for memory operations. Assume that the relative
frequencies of these operations are 40%, 20% and 40%,
respectively. The total clock overhead for pipelined is 0,2 ns.
 The average instruction execution time on the unpipelined
processor is equal to clock cycle * average CPI = 1ns *
((40%+20%)* 4+40%*5) = 4,4 ns
 For pipeline processor average instruction execution time is 1,2 ns
(the clock must run at the speed of the lowest stage plus
overhead).
 Then speedup from pipelining = 4,4ns/1,2ns = 3,7 times
5
5
Konflikty w przetwarzaniu pipeline
 Konflikt zasobów – występuje jeśli kilka instrukcji żąda
tego samego zasobu
 Konflikt danych – występuje jeśli argument instrukcji jest
wyliczany przez instrukcję bezpośrednio poprzedzającą
 Konflikt sterowania – występuje podczas wykonania
instrukcji sterujących, np.: skoku warunkowego
6
6
Page 2
2
Performance of Pipes with Stalls I
A stall causes the pipeline performance to
degrade from the ideal performance
 Let’s start from the previous formula
 It can be calculated as follows
 The ideal CPI on a pipelined processor is almost 1. Hence, we can
compute the pipelined CPI (decreasing the CPI)
 1  pipeline stall clock cycles per instruction
7
7
Performance of Pipes with Stalls II
 When we ignored the cycle time overheads then the speedup can
be expressed by (clock cycles are equal):
 If all instruction take the same number of cycles, which must also
equal the number of pipeline stages (the depth of the pipelined)
then:
 This leads to result that pipelining can improve performance by
the depth of the pipeline (if no pipeline stalls)
8
8
Performance of Pipes with Stalls III
 Now we assume that the CPI of the unpipelined processor, as well as that
of the pipelined , is 1 (decreasing the clock cycle tme).

1
Clock cycle unpipelined
*
1  Pipeline stall cycles per instruction Clock cycle pipelined
 When pipe stages are perfectly balanced and there are no overheads, the
clock cycle on the pipelined processor is smaller than the clock cycle of
the unpipelined processor by a factor equal to the pipelined depth, so
speedup is expressed by:
Speedup from pipelining 
1
* pipeline depth
1  pipeline stall cycles per instruction
 This leads to conclusion, that if there are no stalls, the speedup is equal to
the number of pipeline stages.
9
9
Page 3
3
Structural Hazards
 The overlapped execution of instructions requires
pipelining of functional units and duplication of resources
to allow all possible combination of instruction in the
pipeline.
 If some combination of instructions cannot be
accommodated because of resource conflicts, the
processor is said to have a structural hazard (+ some
functional unit is not fully pipelined)
 It can happened when we need access to:
 Memory
 Registers
 Processor
 To resolve this, we stall one of the instructions until the
required resource is available. A stall is commonly called a
pipeline bubble.
10
10
Przykład konfliktu zasobów
Load
im
Instr.1
Mem. access
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
Instr.2
Instr.3
mem
Mem.
access
Instr.4
11
reg
11
Konflikt zasobów rozwiązanie
Instrukcja
Load
Instr. 1
Instr. 2
Instr. 3
Instr. 4
Instr. 5
12
Numer cyklu zegarowego
1
2
3
4
5
IF
ID
EX
MEM
WB
6
7
8
9
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
stall
IF
ID
IF
10
EX
MEM
WB
ID
EX
MEM
WB
IF
ID
EX
MEM
12
Page 4
4
Structural hazard cost
Let’s assume:
– Data references constitute 40% of the mix,
– Ideal CPI is equal to 1,
– A clock rate for processor with structural hazard is 1.05 times
higher than without hazard
If the pipeline without structural hazard is faster, and by
how much ?
Average intr. time  CPI * Clock cycle time
Clock cycle timeideal
1.05
 1.3 * Clock cycle timeideal
 (1  0.4 * 1) *
The processor without structural hazard is 1,3 times faster.
13
13
Data hazards
Data hazards occur when the pipeline changes the
order of read/write accesses to operands so that the
order differs from the order seen by sequentially
executing instruction on an unpipelined processor.
Let’s consider the execution of following instructions:
DADD
DSUB
AND
OR
XOR
R1,R2,R3
R4,R1,R5
R6,R1,R7
R8,R1,R9
R10,R1,R11
14
14
Przykład konfliktu danych
im
DADD R1,R2,R3
DSUB R4,R1,R5
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
15
DADD
DSUB
AND
OR
XOR
R1,R2,R3
R4,R1,R5
R6,R1,R7
R8,R1,R9
R10,R1,R11
reg
15
Page 5
5
Data hazard - solution
To solve the problem we use technique called
forwarding (bypassing or short-circuiting).
Forwarding works as follows:
 The ALU result from both the EX/MEM and
MEM/WB pipeline registers is always fed back to
the ALU inputs,
 If the forwarding hardware detects that the
previous ALU operation has written the register
corresponding to a source for the current ALU
operation, control logic selects the forwarded
result as the ALU input rather than the value read
from the register file.
16
16
Konflikt danych rozwiązanie
im
DADD R1,R2,R3
DSUB R4,R1,R5
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
reg
Problem rozwiązuje się poprzez zastosowanie
dodatkowych połączeń (bypass)
17
17
Data hazard – next example
To prevent a stall in this sequence, we would need
to forward the values of the ALU output and memory
unit output from the pipeline registers to the ALU and
data memory inputs
DADD R1,R2,R3
18
LD
R4,0(R1)
SD
R4,12(R1)
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
18
Page 6
6
Data Hazards Requiring Stalls
LD R1,0(R2)
im
DSUB R4,R1,R5
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
reg
im
reg
alu
dm
AND R6,R1,R7
OR R8,R1,R9
reg
The load instruction can bypass its results to the AND
and OR instructions, but not to the DSUB, since that
Would mean forwarding the result in “negative time”
19
19
Pipeline interlock
Instruction
Clock cycle number
1
LO R1,0(R2)
IF
DSUB
R4,R1,R5
2
3
4
5
ID
EX
MEM
WB
IF
ID
stall
IF
AND
R6,R1,R7
OR R8,R1,R9
6
7
8
9
EX
MEM
WB
stall
ID
EX
MEM
WB
stall
IF
ID
EX
MEM
10
WB
The Load instruction has a delay or latency that cannot be
eliminated by forwarding alone (we need to add pipeline interlock)
20
20
Branch Hazards
The instruction after the branch is fetched, but the instruction is
ignored, and the fetch is restarted once the branch target is known
Branch
instruction
IF ID
Branch
successor
IF
Branch
successor +1
Branch
successor +2
21
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Stall !
21
Page 7
7
Reducing pipeline branch penalties
Simple compile time schemas (static)
 Freeze (flush) the pipeline – holding or deleting any
instruction after the branch until the branch destination is
known (previous slide).
 Predicted-not-taken (predicted-untaken) – treating every
branch as not taken, simple allowing the hardware to
continue as if branch were not executed, care must be
taken not to change the processor state until the branch
outcome is definitely known.
 An alternative schema is to tread every branch as taken.
As soon as the branch is decoded and the target address
is computed, we assume the branch to be taken and begin
fetching and executing at the target.
22
22
Predicted-not-taken schema
Untaken
branch instr.
Instruction +1
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Instruction +2
Instruction +3
Instruction +4
Taken branch
instruction
Instruction +1
Branch target
Branch target + 1
IF ID
IF
EX
MEM
WB
idle idle
idle
idle
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Branch target + 2
WB
WB
23
23
Delayed Branch
 In a delayed branch, the execution cycle with a
branch delay of one is:
Branch instruction
Sequential successor1
Branch target if taken
 The sequential successor is in the branch delay
slot. This instruction is executed whether or not
the branch is taken
 It is possible to have a branch delay longer than
one, however in practice almost all processors
with delayed branch have a single instruction
delay.
24
24
Page 8
8
The behavior of a delayed branch
Untaken
branch instr.
IF
Branch delay
Instr. i +1
Instruction i+2
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Instruction i+3
Instruction i+4
Taken branch
instruction
Branch delay
Instr. i +1
Branch target
IF ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
Branch target + 1
Branch target + 2
WB
WB
25
25
Przykład przetwarzania pipeline
Rozważmy problem mnożenia wektora przez
skalar
for (i = 1000; i > 0; i = i –1)
x[i] = x[i] + s;
Powyższy program może mieć następującą postać w
języku asemblera
Loop L.D
ADD.D
S.D
DADDUI
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#-8
R1,R2,Loop
26
26
Wykonanie pętli
Bez optymalizacji
Loop L.D
stall
ADD.D
stall
stall
S.D
DADDUI
stall
BNE
stall
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#-8
Z optymalizacją
Loop L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
stall
BNE
R1,R2,Loop
S.D
F4,8(R1)
R1,R2,Loop
Przyjmijmy następujące czasy opóźnień: FP ALU op i
inny FP ALU op – 3, FP ALU op i store double – 2, load
double i FP ALU op – 1, load double i store double - 0
27
27
Page 9
9
Rozwinięcie pętli i jej wykonanie
Loop L.D
F0,0(R1)
loop L.D
F0,0(R1)
ADD.D F4,F0,F2
L.D
F6,-8(R1)
S.D
F4,0(R1)
L.D
F10,-16(R1)
L.D
F6,-8(R1)
L.D
F14,-24(R1)
ADD.D F8,F6,F2
ADD.D F4,F0,F2
S.D
F8,-8(R1)
ADD.D F8,F6,F2
L.D
F10,-16(R1)
ADD.D F12,F10,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D
F12,-16(R1)
S.D
F4,0(R1)
L.D
F14,-24(R1)
S.D
F8,-8(R1)
ADD.D F16,F14,F2
DADDUI R1,R1,#-32
S.D
F16,-24(R1)
S.D
F12,16(R1)
DADDUI R1,R1,#-32
BNE
R1,R2,Loop
BNE
R1,R2,Loop
S.D
F16,8(R1)
Czas wykonania wynosi teraz 14 cykli zegarowych
28
28
Optymalizacja wykonania pętli
Loop
L.D
ADD.D
S.D
DADDUI
L.D
ADD.D
S.D
DADDUI
L.D
ADD.D
S.D
DADDUI
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#-8; Usunięte BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,#-8; Usunięte BNE Zależność danych
F0,0(R1)
F4,F0,F2
F4,0(R1) Może zostać usunięta poprzez wyliczenie
R1,R1,#-8 pośrednich wartości R1, zmianę ofsetu
w instrukcjach L.D oraz zmiejszeniu
R1,R2,Loop
wartości R1 o 24
29
29
Optymalizacja wykonania pętli
Loop
30
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNE
Zależność danych
F0,0(R1)
F4,F0,F2
F4,0(R1); Usunięte DADDUI & BNE
F0,-8(R1)
Zależność nazw
F4,F0,F2
F4,-8(R1); Usunięte DADDUI & BNE
F0,-16(R1)
F4,F0,F2
F4,-16(R1)
Jeśli zmienimy nazwy rejestrów
R1,R1,#-24
Pozostanie tylko zależność danych
R1,R2,Loop
30
Page 10
10
Optymalizacja wykonania pętli
Loop
31
L.D
ADD.D
S.D
L.D
ADD.D
S.D
L.D
ADD.D
S.D
DADDUI
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1); Usunięte DADDUI & BNE
F6,-8(R1)
F8,F6,F2
F8,-8(R1); Usunięte DADDUI & BNE
F10,-16(R1)
F12,F10,F2
F12,-16(R1)
R1,R1,#-24
R1,R2,Loop
31
Page 11
11

Podobne dokumenty