Introduction to Computer Organization

EECS 370 Review Session

Single Cycle Datapaths

First of all, the clock period is how fast each cycle runs in a single cycle datapath. lw instructions are usually what determine the clock speed, since they use tons of stuff:

Instruction fetch
Decode
ALU access
Memory access
Register write

What is beautiful about single cycle datapath? CPI. CS get hung up on two words all the time: clock period and CPI. In single cycle datapaths, you have a CPI of 1. This is super cool. And here's why.

Let's say you have the following processor statistics:

Register file read/write: 5 ns
ALU access: 10 ns
Instruction memory read: 15ns
Data memory read: 20ns
Data memory write: 40ns
Adder: 2ns
Other: 0ns

First, what the heck is the adder doing compared to ALU?

The adder is for incrementing PC. We don't want the PC to be using the ALU, because it's too far away. We instead put in an adder which is hooked up to the PC.

This number 2ns for the adder time really is not relevant, since the instruction memory read takes 15ns, and the adder can be done in parallel. We don't need the adder component for any other instruction.

Note that this means that a single cycle datapath has more complex hardware than a multicycle datapath. In a multicycle datapath, you can reuse hardware components.

Now, we're going to test jalr and see if that determines the clock period. jalr takes instruction fetch, which takes 15ns, then reads the register file, which takes 5ns, then writes back to the register file, which takes 5ns. This means that this is 25ns total. Does this determine the clock period? Maybe not. Let's try lw.

In lw instructions, you have 15ns for instruction read, 5ns for register file read, 10ns for ALU access to calculate offset, and 20ns for data memory read, and 5ns to write back to the register file. This is a total of 55ns. Much larger than for a jalr instruction.

We're not satisfied with this clock period, since it's too long. Notice how we haven't used the 40ns time yet – the data memory write. This is used in sw instructions.

sw takes a 15ns instruction memory read, then a 5ns register file read, then a 10ns ALU access to calculate offset, then 40ns to do a data memory write, and that's it. That's a total of 70ns. Damn.

What each stage does

IF: Reads the instruction from instruction memory.
ID: Breaks apart the instruction and accesses instruction memory.
EX: Adds any numbers together with ALU access.
MEM: Either reads or writes to memory with a data memory access.
WB: Writeback, which writes the instructions to destination files using register file write

Could we go through the things you have to do for sw, lw, jalr?

Yes.

For add, you have instruction fetch, then register file read and write, then ALU access, then register file write.

beq is just instruction retch, register read, and ALU.

nand is the same thing as add.

noop is just instruction fetch. Note that you also have to do instruction decode, but since this is single cycle datapath there is no "decode" delay.

halt is the same thing as noop. Instruction fetch, and then stops.

add = IF + RF + ALU + RF
nand = IF + RF + ALU + RF
beq = IF + RF + ALU
noop = IF
halt = IF
lw = IF + RF + ALU + MEMREAD + RF
sw = IF + RF + ALU + MEMWRITE

Note: you should always say "register file" read and write, and not "register" read and write. Technically, PC is a register. Always say register file read and write. 370 is about detail, so you have to be careful.

That's just all you have to know about single cycle datapaths.

Multicycle Datapaths

Justification

First off, here's the justification for using multicycle datapaths. We do one operation in each cycle in single cycle. Let's say the test program you're going to run is 100% noops. We're just idiots running noops all the time. The problem is that the single cycle datapath is designed to run all instructions. That would mean the clock cycle is 70ns. If you have 1000 noops, this would take 70,000ns. If you have not much work done though with all those noops.

If we were to graph work vs. time, you would spend the first 15ns fetching, and then waiting the rest of the 55ns until you get to the next clock period. This is beyond inefficient. Imagine a single cycle processor that was built just for noops. The clock cycle would be 15ns, so you're doing work all the time.

Determining Clock Cycle

Someone looked at single cycle, and thought "wow, this is dumb. Let's partition the instruction up into parts, and then run them together." This makes sense - you have five parts to each instruction:

Instruction fetch
Instruction decode
Execute
Memory
Writeback

If you're noop, then you just do fetch and decode and then instead of waiting, immediately go to the next instruction. That sounds nice, right?

In fetching, all that happens is instruction memory. This is 15ns always.
In decode, this is just register read so that's 5ns.
In execute, that's just ALU access, which is 10ns.
With memory, you could read or write, so account for the worst case. This is the time it takes to data write, or 40ns.
In writeback, this is register file write, or 5ns.

Now, the clock period is forced to be 40ns. With a noop, this means that you have to do instruction fetch and instruction decode, or two cycles of 40ns: 80ns total. Wait, this is worse! What the heck? We can do better. Well, now we introduce pipelining to make it better.

In the multicycle datapath, lw is definitely the longest. It always takes five cycles. Pretty much everything else is taking 4 cycles (nobody ever mention jalr).

Exam-Style Question

In every exam from around 10 years ago, there is always a question like this.

Some information:

40% add/nand
25% beq
30% lw
5% sw

Let's say you have 2000 instructions. You know the clock period to be 40ns. How long does this take to execute on a multicycle datapath?

This means 800 add and nand instructions, 500 beq instructions, 600 lw instructions, and 100 sw instructions. On a multicycle datapath without pipelining, this takes:

800 4 cycles 40ns = 128 microseconds time for add/nand
500 4 cycles 40ns = 80 microseconds time for beq
600 5 cycles 40ns = 120 microseconds time for load word
100 4 cycles 40ns = 16 microseconds time for store word

This is a total of 344 microseconds.

A fast way of calculating CPI (if there is noop), then everything is 4 cycles except for load word which is five. Then 30% is one cycle higher, so the CPI is 4.3. Your number of instructions is 2000 instructions, so:

$$2000 \text{ instructions} 4.3 \text{ CPI} \text{ 40ns clock cycle} = 344,000 \text{ns} = 344\mu \text{s}$$

Pipelining

Justification

In multicycle datapath, there is an instruction memory, a register file, an ALU, a data memory, and writeback. When we do multicycle, we said that it's worse, but it's still... developed? than single cycle.

First we do fetch. In fetch, what do we use? Instruction memory. Throughout the fetch cycle, the rest of the components are unused. While you're decoding, you use the regsiter file and nothing else. When you're executing, you use the ALU only and nothing else. This is horribly inefficient. What pipelining says is that, while you're executing the first instruction, why not fetch the next instruction?

Advantages

Every cycle, an instruction is completed. CPI is around 1. Even though every cycle has to go through five different stages, at the end of every cycle an instruction finishes. This is smart, because you don't have to wait for the drastically long clock period of single cycle, but equally you don't have to wait for like 5 cycles for lw to complete too.

The problem with this is when you have dependencies. Let's say you have:

add  1   2   3
add  3   6   5

Let's say your memory is:

M[1] = 100
M[2] = 100
M[3] = 0
M[6] = 0

In the ideal case, this add instruction should add 100 to 100 and get 200, then add 200 and 0 to get 200.

Data Hazards

However, when EX of the first add happens, ID of the second instruction is happening at the same time for the next instruction. The first instruction hasn't wrote back yet, so M[3] still seems to be 0. This is a data hazard.

There are several options to deal with this.

Avoiding

You just insert avoid the case where this happens with software. This is super lazy for hardware and super annoying for software.

Detect and Stall

Well, just insert noops so you can wait! This would mean you would have to wait until the writeback of the first instruction. This is sort of good, since whoever is writing the code does not have to worry about the pipeline. That's the whole point why C was invented.

This destroys our "1 CPI" image. At the end of every cycle, with a dependency this is destroyed. You add two noops, and wait until WB finishes.

Even though we didn't write back the answer, we had it all along. We have it at EX. It's like bureaucracy - everyone know what the answer is, but you just have to push it through anyway.

Detect and Forward

To solve these problems, we install a forwarding path from EX to ID. This only happens if you have a dependency. If there's not a dependency, just read from the register file like normal. How do we know that there's a dependency? We'll get into that in a little bit later.

There is only one case in which you would need a noop here.

A lw's result isn't known there, though. It does RegA + offset, but it needs to READ from the memory location at RegA + offset. That happens at MEM, since the EX result is just an address value. We have to wait, just for one cycle (by putting a noop), and wait for lw to go through MEM. This happens by adding a forwarding path from MEM to ID as well.

Hence, in detect in forward, only add a noop if there is a lw instruction that writes to a register which is used in the direct next instruction.

How do we detect if we have a data hazard?

Every time an instruction goes through, you ask two questions:

Does this instruction write something back to the register file? If the instruction is add, nand, or lw, this is yes.
What register does this write back to? This has to be a register 0 - 7.

It always keeps track of EX, MEM, and WB, and always keeps track of what registers they are writing to. Every time the pipeline moves forward, this slides because instructions move.

Note: you do not implement jalr for pipelining. jalr is a huge control hazard.

So, back to detecting data hazards. When you do ID, you check if there is a hazard to begin with, and where you have to forward from.

Control Hazards

Let's say that you have a beq instruction. Some instructions can start executing while you're still deciding if this beq instruction is true or not. Because of this branch behavior, you cannot predict what the next branch is going to be.

Avoiding

One option is to just avoid it! Why would you use branches anyway!? Again, this is a supremely lazy choice for hardware and really annoying for software.

Detect and Stall

First, you decode if there is a branch, and you put noops until you know whether it is taken or not. This is the same as detect and stall for data hazards.

Speculate and Squash (a.k.a., "Gamble and Lose")

You know, I don't need to do this. If I can just predict which branch will be taken, I can fetch from the predicted branch. No big deal. However, you can't predict it because you're not an oracle. If you're right, yay! You got it right. If you're wrong, you have to go into the pipeline registers and overwrite whatever is in the pipeline registers and replace it with noops.

In the worst case, this is as bad as detect and stall. However, you can't be that bad (probably), and you can't be that good either. This is entirely dependent on how well you can predict branches. This is a hot topic of research.

Branch Prediction

1-bit Predictor

There are some ways to do this. Let's create a 1-bit predictor. 0 is not taken and 1 is taken. Whenever a branch is taken, you update the branch prediction register to 1, and not taken resets it to 0.

If the register is 1, predict taken. Else, predict not taken.

2-bit Predictor

You can also use two bits. You start from some initial value, say 00. When a branch is taken, you increment this register. When it is not taken, you decrement the register. When the register is 00 or 01, predict not taken. When the register is 10 or 11, predict taken.

Three C's

This teaching has changed a lot during 370. Once you have understood it, it kind of becomes irrelevant. So professors take shortcuts. But then on homework you get confused.

Three C's Example

We have a cache with the following configuration: 2-way set associative, write-allocate, total size of 32 bytes, and block size of 8 bytes. The replacement policy is LRU. The cache is empty at the start. All loads and stores are 1-byte operations.

The most important information when you're doing 3 C's is the block size. We only care about the block size in the beginning. That is because we simulate a fully associative cache of the block size. That means that you can ignore the last log(8) = 3 bits for block offset.

Accesses:

0x10, load
0x18, load
0x20, store
0x30, store
0x18, load
0x10, load
0x30, store
0x28, load
0x20, store
0x18, store
0x10, load

Let's say your first access is 00010000. This means your tag is 00010 and block offset is 000.

Your second access is 00011000. This means your tag is 00011 and block offset is 000.

Your third access is 0x20 = 00100000. This means your tag is 00100 and block offset is 000.

Your fourth access is 0x30 = 00110000. This means your tag is 00110 and block offset is 000.

Now, on your fifth access, you access 0x18 = 00011000. You have already accessed the block 00011.

On the sixth access, you access 0x10 = 00010000. You have already accessed the block 00010.

On the seventh access, you access 0x30 = 00110000. You have already accessed the block 00110.

On the eighth access, you access 0x28 = 00101000. This is the tag 00101, which you have not accessed before.

On the ninth access, you access 00100000, which is a tag of 00100. Hit.

On the tenth access, you access 00011000, which is a tag of 00011. Hit.

On the eleventh access, you access 00010000, which is a tag of 00010. Hit.

Now. There are four blocks in the cache, since the total size is 32 bytes and the block size is 8 bytes. We have a LRU replacement policy.

Only new misses count as capacity misses. All the old misses are going to remain as compulsory misses.

For the first four accesses, you cache looks like:

$$(10, 11, 100, 110)$$

LRU is 10.

Since the next reference is 11, this is a hit.

Since the sixth reference is 10, this is a hit and LRU is now 100.

Since the seventh reference is 110, this is a hit and LRU is still 100.

Since the eighth reference 101 is a miss, you evict the LRU block of 100 and replace it with 101. LRU is now 11.

Since the ninth reference was 100, and you just replaced it, this is a capacity miss.

For the last two references: both 11 and 10 are not in the cache, so the rest of the accesses are capacity misses.

If your cache is fully associative, you cannot have conflict misses. The whole idea of conflict misses is that you can increase the associativity to increase the likelihood of a hit. In a fully associative cache, you can't increase the associativity.

In the set associative cache in our example, the fourth-rightmost bit is the set index bit. Now, the cache access pattern is:

1: Tag: 1, Set: 0, Miss
2: Tag: 1, Set: 1, Miss
3: Tag: 2, Set: 0, Miss
4: Tag: 3, Set: 0, Miss
5: Tag: 1, Set: 1, Hit
6: Tag: 1, Set: 0, CONFLICT MISS. This is because this was a hit in a fully associative cache, and a miss in a set associative cache.
7: Tag: 3, Set: 0, CONFLICT MISS. This is because this was a hit in a fully associative cache, and a miss in a set associative cache.
8: Tag: 2, Set: 1, Miss
9: Tag: 2, Set: 0, Capacity miss
10: Tag: 1, Set: 1, HIT? This becomes a hit. You don't have your total tally of hits until you do the third step.
11: Tag: 1, Set: 0, Capacity miss

Another Example

A friend that leads a robotics team approaches you to help optimize some image processing code. The code is designed to run on a simple low-power microcontroller where an "int" is 4 bytes. The processor uses a fully-associative data cache with 16-byte blocks that can hold 8 blocks.

When the following code compiles, everything but "pixels" is assigned to a register (i.e., references to "pixels" are the one ones that may use the cache). Assume all struct instances start at address 0.

Compute the number of cache misses given the following code:

struct pixel {
    int x;
    int y;
    int color_intensity;
    int filter_response;
};

struct pixel pixels[256];

void compute_average(void) {
    int i;
    int sum;

    sum = 0;
    for (i = 0; i < 256; i++)
        sum += pixels[i].x;
    printf("avg_x=%d\n", sum/256);
    sum = 0;
    for (i = 0; i < 256; i++)
        sum += pixels[i].y;
    printf("avg_y = %d\n", sum/256);
}

This just iterates through all pixels, and finds the average of x and average of y.

Our cache is fully associative, which means that there is some sort of LRU thing going on. There are 16 byte blocks, which is very convenient since ints are 4 bytes. Therefore, every block can hold one pixel. There are 8 blocks, so you can hold 8 structs at a time.

In the first for loop, it goes from 0 to 256. It will always miss 256 times, since you only bring in 1 block to the cache per access. When you bring the next struct in, it's always a new block. Now, if our blocks were 32-byte blocks, then the first access would be a miss and the second one would be a hit each time, for all 256 loads. Hence, the hit rate would be 50%.

Here's one optimization we could make:

for (int i = 0; i < 256; i++) {
    sum_1 += pixels[i].x;
    sum_2 += pixels[i].y;
}

This combines the two loops into one. Now, each pixels[i].y call is a cache hit.

Something else that would be really helpful would be to split up the pixel struct into:

struct pixel_coord {
    int x;
    int y;
}

struct pixel_vals {
    // ...
}

Now, each struct of coordinates only takes up 8 bytes of memory! Now, the second struct will hit and hit with two iterations. Therefore, it will miss once and hit three times. You just halved the miss rate again.

Virtual Memory

Justification

Earlier, we learned from pipelining that programmers don't want to have to deal with avoidance - they don't want to have to write programs to fit the hardware. CS people will make CE people do their bidding. Virtual memory is another thing that CE people did for CS people.

What happens is that my program needs to deal with addresses all the time. I will have some type of address in here:

lw 0 1 0x8000

Let's say I love working with this address. At the same time, my friend is playing with the same memory address on a program that is running at the same time:

lw 0 2 0x8000

What happens when I run the two at the same time? You can end up getting undefined behavior. It would be really cool if you could use some level of indirection so you could split up the memory into separate "chunks", which are independent of each other.

When these addresses go into a magic black box for virtual memory, they are translated from a virtual address space to a physical address space. If virtual addresses are 16 bits wide, you can reference 2^16 possible addresses.

Sometimes it's really hard for a computer to know exactly which address maps to which physical address. So you create a page table. A page table lives in physical memory. Each program has its own page table. A page table is a mapping of virtual addresses to physical addresses. If there are 16 bits for addresses, then you need to have 2^16 possible entries in each page table. For each program. This can get really big, really fast. Each of those addresses have to be translated somehow.

What happens if you have a physical address per virtual address? You do not want to have 2^16 entries. Main memory is partitioned into pages. These are called physical pages. Let's start there. If your page size is 16 bytes, then in the zeroth page you have addresses 0-15. The next page contains addresses 16-31.

Why does a page table need to live in memory? Why can't we just design hardware to do page table stuff for us? Imagine you have hardware doing this. What happens? You would need more hardware. If you have hardware designed to hold four page tables, your computer can only be doing four processes at once.

What kind of data structure is a single-level page table?

A single level page table is implmented with an array. A multi-level page table is implemented with a hash table.