Introduction to Computer Organization

Multicycle Datapaths and Intro to Pipelining

As a recap, the drawbacks of using a single cycle datapath are:

  • All instructions run at the rate of the slowest instruction.
  • Adding a slow instruction to your ISA can really slow things down.
  • Parts of your processor cannot be reused.

In a multi-cycle execution model, each instruction takes multiple cycles to execute. Cycle time is reduced, slower instructions take more cycles, and you can reuse datapath elements each cycle. To make this work:

  • You need more and/or wider MUXes.
  • You may need extra registers if you need to remember an output for 1 or more cycles
  • Control is more complicated, since you need to send more signals each cycle.

This is what the multicycle datapath looks like:

Note how there is only one memory: the data memory and instruction memory are combined now. Also, there is only one adder (ALU) in the entire datapath.

There is also an instruction register. This is allocated 4 bits, and hence 16 possible states for the finite state machine. In each cycle, there is an input opcode, 12 output bits, and transition functions.

Fetch Example

The first state is fetch cycle.

  • Read opcode and store in instruction reg.
    • Read memory contents at PC counter, and store in instruction register.
    • Enable a memory operation, set memory read mode and instruction register write.
  • Increment PC.
    • In parallel to doing this, we can compute PC + 1. We want to do this now, since there is nothing we can do with the ALU now otherwise.
    • Send a 1 to the ALU through the MUX ALU2.
    • Set the ALU to compute add rather than nand.
    • We don't, however, have the PC_write signal enabled. We still have to do this later.

The next state is decode.

  • Update PC.
    • ALU result drives
    • PC write enabled
  • Use opcode to determine next state.
    • Instruction reg drives
    • Use control ROM output

Then do add operations.

A problem

In the current LC2k ISA, there must be 17 states with jalr included. How can we reduce the number of states for add and nand so that we can fit the decoder in 4 bits instead of 5?

There is a special other ROM that tells you which state to go to next. It takes as input the opcode, and tells you the next state to go to based on the opcode.

lw example

First, calculate the address for the memory reference.

  • Instruction register drive
  • Sign extend enabled
  • Register A contents drive
  • Route offset and regA through MUXes to ALU

The contents of (regA + offset) is now in ALU result.

Then, you have to read the memory location:

  • ALU result drive
  • MUX for memory takes in ALU result
  • Read enabled on memory
  • Memory drive
  • Drive instruction reg to register file MUX
  • Register file write

Go back to fetch instruction.

sw example

The same as sw, except you just write enable instead of read enable.

beq example

Calculate the target address for the branch (Calculate PC + 1 + offset)

  • Route PC to first ALU MUX
    • You already incremented PC in fetch, so PC already contains PC + 1.
  • Route instruction register sign extended offset through second MUX
  • ALU computes and and stores in ALU result

Write target address into PC, iff (data in regA) == (data in regB)

  • Do equality comparison in ALU with XNOR each bit and see if all 1's
  • If equal:
    • ALU result drive
    • PC write

This takes the assumption that EQ takes short enough to propagate throughout the entire circuit before the clock cycle ends. It's not a bad assumption – if you have a 32-bit adder, then doing the equality comparison is a lot quicker. If this isn't true, then you have a few options:

  • Extend the clock period (but you would slow down every instruction)
  • Make this instruction take a full additional clock cycle to finish.
    • This would require that we have an ALU "equals" register, or...
    • We could add an additional bit to the ALU result register, which is 1 if the two numbers are equal and zero otherwise.

Now you go back to fetch.

What about jalr?

jalr example

jalr needs to implement:

jalr regA, regB
===============

regB = PC + 1
PC = regA

You need to get PC + 1 and store regB. However, since there is no path from PC to regB, you need to have a first cycle dedicated to adding PC + 1 on its way to regB. You can do this by putting it through the top MUX on the ALU, and adding zero with the bottom MUX (adding 0 to PC). Store this in ALU result.

In the next state of this, take ALU result, bring it through the register file bottom MUX, and register B coming from instruction reg into the top mux and finally register file.

Now, you need to do the same process to bring regA into the ALU result, and then back into the PC. This takes two cycles again.

Well... if you only use one half of the datapath each cycle in jalr, couldn't you overlap and do two things at the same time in one cycle? You can! WITH PIPELINING! [cue fanfare]. This means you could do jalr in 3 instructions instead of 4. Fancy, right?

How does it perform?

Without pipelining, multicycle datapaths are slower than single cycle datapaths. Wait, what? Shouldn't it be faster, because each instruction speed is independent of the slowest instruction?

With pipelining, you can make it faster. You reuse the idle resources right away.

Pipelining

Some things that we care about:

  • Response time: How long does it take for a job to finish?
  • Throughput: How much work can you get done within a specified time?

According to Little's law, when response time gets better, throughput gets worse, and vice versa.

The execution time is the response time in the case of processors. The execution time is the "iron law" of performance:

$$\text{Total amount of instructions executed} \text{CPI} \text{Clock period}$$

The CPI is the average number of clock cycles per instruction for an application. Your goal is to improve the CPI without impacting the clock period. You can make the CPI better and clock cycle worse by using multicycle, CPI worse but clock cycle better by using single cycle, but you can make both better by using pipelining.

Instead of waiting for instructions to finish, you start instructions as soon as possible without having competition for resources.

More pipelines == better?

Why can't we just put 1000 pipelines, and have awesome execution time for the whole program? Why don't people do that?

There's a small delay that happens in between pipelines. When this becomes large, then there is a problem.

How it works

There are pipeline registers, where the previous state values for the pipeline are stored. It's just a bunch of flip flops.

This makes processor design more complicated, because you want to reduce the amount of times that two instructions share states.