Introduction to Computer Organization

Pipeline Control Hazards

Previously, we covered data hazards. If you write really late in the pipeline, and you need the value really early in the pipeline, then you could use the wrong register contents. The simplest thing you could do is to put in noops. It doesn't slow down the clock, but it requires knowledge to the programmer and does not have backwards compatibility. Also, the code bloats. Instead, you could put detect and stall hardware in so the code doesn't bloat. You might have to slow down the clock a little bit.

A more advanced feature is using detect and forward. You take the output from the ALU, and straight route back the input to the ALU in the next cycle. If you use add instructions, you don't have to stall with forwarding. However, you need to use a single noop with load instructions. The value is not ready early enough to get through without a stall. You have to wait until the memory instruction.

Control Hazards

These come up because of branches. If you branch, you may not know if you're going to jump, or where you're going to jump until you evaluate the expression.

Pipeline function for BEQ:

  • Fetch: Read instruction for memory.
  • Decode: Read source operands from registers.
  • Execute: Calculate target address and test for equality.
  • Memory: SEND TARGET TO PC.
  • Writeback: Nothing to do. Maybe start fetching next instruction.

Three ways to handle:

  • Avoid (make sure there are no hazards).
    • Potential way: Get rid of them by getting rid of all control! Pretty hard to use for most people for most applications.
  • Detect and stall: Put a bunch of noops when you need branches.
    • For branches, stick a bunch of noops for them. Easy, but it works.
    • Find instructions that you want to execute both ways, and stick other instructions inbetween instead of waiting.
  • Speculate and squash-if-wrong: Go ahead and fetch more instructions in case it is correct, but stop them if they shouldn't have executed

Problems with delayed branches

  • Old programs may not run correctly on new implementations.
  • Programs get larger as noops are included.
  • Program execution is slower.

Detect and stall

Advantage: no software bloat. Disadvantage: have to add hardware.

The CPI increases every time a branch is detected! Is that necessary? Not always! Sometimes the branch is not taken. You can keep fetching, assuming the branch is not taken. If you are wrong, then that is okay as long as you do not COMPLETE any instructions you mistakenly executed. As the instruction has not modified any globally visible state, you're fine.

Speculate and squash

Just set noops everywhere for the instructions that need to be squashed.

For example, if you have the following program:

beq 1 2 1
sub 3 4 5
add 6 7 8

You need to consider the time filling the pipeline, the time getting the instructions through, and the time squashing.

If the branch is not taken, you don't lose any time. The time filling the pipeline is 4 cycles, and the time to execute each additional instruction is 1, so 7 cycles total.

If you speculate the branch is taken, and it is really not taken, then it takes 4 cycles to fill the pipeline, plus 2 to get them through, plus 3 because the branch was taken! This increases the CPI each time the branch is taken!

The second we know the address of the instruction we want to fetch, we want to know: is it a branch instruction? If it is, where is the target address? Is it taken or not?

For the LC2k, you can reliably know the target address for any jump instruction, since it is just PC + 1 + Offset! All of that information is encoded into the instruction! You can't reliably predict that the branch will be taken, since it is based on the contents of registers. You can predict with some accuracy, but not 100% accuracy.

Branch Prediction

Predicts the next fetch address (to be used in the next cycle). It requires three things to be predicted at the fetch stage:

  • Whether the fetched instruction is a branch
  • Branch direction (if conditional)
  • Branch target address (if direction is taken)

Observation: Target address remains the same for a conditional direct branch accross dynamic instances

  • Store the target address from previous instance and access it with the PC
  • Called the Branch Target Buffer (BTB) or Branch Target Address Cache

The first time you encounter an instruction, you just guess. No problem, since the instruction will likely be ran hundreds of thousands of times.

  • Compile time (static)

    • Always not taken
    • Always taken
    • BTFN (Backwards taken, forward not taken)
      • Tends to branch backward most of the time (loops)
    • Program analysis based (likely direction)
  • Run time (dynamic)

    • Last time prediction (single bit)
    • Two-bit counter based prediction
    • Two-level prediction (global vs. local)
    • Hybrid

Static Branch Prediction

Always not taken:

  • Simple to implement, no BTB, no direction prediction -> hardware simpler
  • Accuracy is bad: ~30-40%.
  • Compiler can layout code such that the likely path is the "not taken" path

Always taken:

  • No direction prediction
  • Better accuracy: ~60-70% (loops)

Backward taken, forward not taken:

  • If backward, branch not taken. Else taken.

Dynamic Branch Prediction

  • Last time predictor: single bit per branch (stored in BTB)
  • Example: TTTTTTTTTNNNNNNNNN: 90% accuracy. However, you could get TNTNTNT... 0% accuracy.

You can use more states to improve performance by using a 2-bit saturating 1-bit counter. If the number of the last taken is 3 or 2 out of 3, then predict taken. Else, predict not taken.

Exceptions

Sometimes you have exceptions, such as divide by zero or overflow. After the exception occurs, it makes sure that the instructions before happen, and the instructions after never occurred (never changed globally visible state). Then, you jump to the memory address (jalr) to handle that exception.

Superscalar pipelining

If you get lucky, you can get close to a CPI of 1 (ideal case – no stalls). In reality, it won't be quite that good. If you want to improve performance more, than you can use multiprocessors. You could share the cache, but not the register files.

In superscalar pipelining, you build two (or more) pipelines that execute in parallel. It's pretty hard for most programmers to debug multithreaded programs compared to singlethreaded programs.