As a recap, the drawbacks of using a single cycle datapath are:
In a multi-cycle execution model, each instruction takes multiple cycles to execute. Cycle time is reduced, slower instructions take more cycles, and you can reuse datapath elements each cycle. To make this work:
This is what the multicycle datapath looks like:
Note how there is only one memory: the data memory and instruction memory are combined now. Also, there is only one adder (ALU) in the entire datapath.
There is also an instruction register. This is allocated 4 bits, and hence 16 possible states for the finite state machine. In each cycle, there is an input opcode, 12 output bits, and transition functions.
The first state is fetch cycle.
add
rather than nand
.PC_write
signal enabled. We still have to do this later.The next state is decode.
Then do add
operations.
In the current LC2k ISA, there must be 17 states with jalr
included. How can we reduce the number of states for add
and nand
so that we can fit the decoder in 4 bits instead of 5?
There is a special other ROM that tells you which state to go to next. It takes as input the opcode, and tells you the next state to go to based on the opcode.
lw
exampleFirst, calculate the address for the memory reference.
The contents of (regA + offset) is now in ALU result.
Then, you have to read the memory location:
Go back to fetch instruction.
sw
exampleThe same as sw
, except you just write enable instead of read enable.
beq
exampleCalculate the target address for the branch (Calculate PC + 1 + offset)
fetch
, so PC already contains PC + 1. and
and stores in ALU resultWrite target address into PC, iff (data in regA) == (data in regB)
This takes the assumption that EQ takes short enough to propagate throughout the entire circuit before the clock cycle ends. It's not a bad assumption – if you have a 32-bit adder, then doing the equality comparison is a lot quicker. If this isn't true, then you have a few options:
Now you go back to fetch.
What about jalr
?
jalr
examplejalr
needs to implement:
jalr regA, regB
===============
regB = PC + 1
PC = regA
You need to get PC + 1 and store regB. However, since there is no path from PC to regB, you need to have a first cycle dedicated to adding PC + 1 on its way to regB. You can do this by putting it through the top MUX on the ALU, and adding zero with the bottom MUX (adding 0 to PC). Store this in ALU result.
In the next state of this, take ALU result, bring it through the register file bottom MUX, and register B coming from instruction reg into the top mux and finally register file.
Now, you need to do the same process to bring regA into the ALU result, and then back into the PC. This takes two cycles again.
Well... if you only use one half of the datapath each cycle in jalr
, couldn't you overlap and do two things at the same time in one cycle? You can! WITH PIPELINING! [cue fanfare]. This means you could do jalr
in 3 instructions instead of 4. Fancy, right?
Without pipelining, multicycle datapaths are slower than single cycle datapaths. Wait, what? Shouldn't it be faster, because each instruction speed is independent of the slowest instruction?
With pipelining, you can make it faster. You reuse the idle resources right away.
Some things that we care about:
According to Little's law, when response time gets better, throughput gets worse, and vice versa.
The execution time is the response time in the case of processors. The execution time is the "iron law" of performance:
$$\text{Total amount of instructions executed} \text{CPI} \text{Clock period}$$
The CPI is the average number of clock cycles per instruction for an application. Your goal is to improve the CPI without impacting the clock period. You can make the CPI better and clock cycle worse by using multicycle, CPI worse but clock cycle better by using single cycle, but you can make both better by using pipelining.
Instead of waiting for instructions to finish, you start instructions as soon as possible without having competition for resources.
Why can't we just put 1000 pipelines, and have awesome execution time for the whole program? Why don't people do that?
There's a small delay that happens in between pipelines. When this becomes large, then there is a problem.
There are pipeline registers, where the previous state values for the pipeline are stored. It's just a bunch of flip flops.
This makes processor design more complicated, because you want to reduce the amount of times that two instructions share states.