Introduction to Computer Organization

Cache Organization: Block Size, Writes

Remember that the memory in a modern system has many levels of cache. Typically, the more data that the memory can store, the slower it is. It turns out that a third of the core of a processor is a cache.

For each cache line, you have:

1-bit valid bit
4-bit tag bits
8-bit data bits

This has 5 bits of overhead for every 8 bits of data! This is a problem. How do we reduce this? What if instead of storing one byte per tag, we stored multiple bytes per tag?

Advantages:

Increase the data
Reduce the bits needed per tag
Can get a higher hit rate by bringing sequential values to higher cache level (loops!)

Let's say you have 16 memory addresses, and originally a block size of 1. If you double the block size, the tag space is halved and hence you can actually shave a bit off of the tag field!

Example

ld r1 M[1]

This loads memory locations 0 and 1. Originally the cache is empty, so this is a miss. The cache then looks like:

Valid 1, Tag 0, M[0], M[1]
Valid 0

Notice how M[0] and M[1] are loaded. Note that the tag is 0 - this is the address of the first block of size 2 in memory.

Next, let's say you do:

ld r2 M[5]

This loads memory locations 4 and 5, and this is also a miss. The cache looks like:

Valid 1, Tag 0, M[0], M[1]
Valid 1, Tag 2, M[4], M[5]

Again, the tag is the address divided by block size (integer division).

Now, if you do:

ld r3 M[1]

This is a hit, since you have loaded the value before.

Here's where it gets interesting. Say you load M[4]. You haven't seen this address before, but it's in your cache since you loaded the block with M[5] in it already.

ld r4 M[4]

This is a hit.

Spatial Locality

Notice how when we accessed the address 1, we also brought in the address 0. This turned out to be a good thing, since we later referenced the address 0 and found it in the cache.

Due to spatial locality, if you reference a memory location (e.g., 1000), we are likely to reference a location near it (e.g., 1001). This is in contrast to temporal locality, which states that you are likely to load something you have loaded recently. This is the foundation of how cache works.

Block Offset Size vs. Tag Size

You have to store the tag and the block offset together. Let's say that you have 32 bits to do so. Let's look at a couple scenarios to determine how big the tag needs to be, and how big the blcok offset needs to be.

If your blocks have a size of one byte, then you do not need to store a block offset whatsoever. The tag size is 32 bits, and block offset size is 0 bits.

If your blocks have a size of two bytes, then you have to store a single bit for the block offset, and you have to use the remaining bits for the tag. The tag size is then 31 bits, and block offset size is 1 bit.

General Formula

In general, the block offset size is always going to be $$\log_2 (b)$$, where $$b$$ is your block size in bytes. The tag size will always be the number of bits for the address (tag + offset), minus the block offset size.

Where does the best equilibrium lie? How big should the block size be to maximize locality?

Deciding on Block Size

Let's say that you have a super huge block size. If the block size is on the scale of your cache, then you only have one cache line. This is really inefficient. On the other hand, if your block size is one byte, you have to do a lot of repeated loads. How do we decide the best block size?

It turns out, computer engineers just take a bunch of applications that they believe are representative of what their processor will run in real life, and they simulate the performance of different cache sizes.

Results

Most systems use a block size between 32 bytes and 128 bytes. These longer sizes reduce the overhead by:

Reducing the number of CAM entries
Reducing the size of each CAM entry

Keeping Track of LRU

When you load something from the memory into the cache, you have to replace the least recently used block. To keep track of this, hardware keeps a list. To store this ordering, you need some bits to store the ordering of the cache lines in the LRU list.

Bits needed

Since you need to store a number for each cache line entry, this amounts to:

$$\log_2 (l)$$

Where $$l$$ is the number of cache lines you have.

Example

Let's say that you have a cache with total size 8 bytes. The block size is 2 bytes, and this has LRU replacement. The memory address size is 16 bits, and it byte addressable.

How many bits are there for each tag?

You know that there are 16 bits for the memory address size. Since the bits for block offset is $$\log_2 (b) = \log_2 (2) = 1$$, there are 15 bits remaining for the tag size.

How many blocks are there in the cache?

Since there are 8 bytes for the cache's total size, and 2 bytes per block, there are 4 lines total.

For the following stream, indicate whether each is a hit or miss: 0, 1, 3, 5, 12, 1, 2, 9, 4

When you retrieve 0, your cache contains the address 0000 in the block 000. Since this wasn't in the cache before, this is a miss.
When you retrieve 1, your cache contains the address 0001 in the block 000. Since this was in the cache before, this is a hit!
When you retrieve 3, your cache contains the address 0011 in the block 001. Since this wasn't in the cache before, this is a miss.
When you retrieve 5, your cache contains the address 0101 in the block 010. Since this wasn't in the cache before, this is a miss. The cache now contains blocks 000, 001, and 010.
When you retrieve 12, your cache contains the address 1100 in the block 110. Since this wasn't in your cache, this is a miss. The cache now contains blocks 000, 001, 010, and 110.
When you retrieve 1, since the block 000 is in the cache, this is a hit.
When you retrieve 2, since the block 000 is in the cache, this is a hit.
When you retrieve 9, your cache contains the address 1001 in the block 100. Now, the cache is full, you replace the most recently used block (010) with this one. This is a miss.
When you retrieve 4, your cache contains the address 0100 in the block 010. Annoyingly, you just kicked the block 010 out of the cache, so this is a miss.

What is the hit rate of the last problem?

Since you had three hits out of 9 loads total, this is a hit rate of 33%.

How many bits are needed for storage overhead?

You always need a valid bit of size 1 bit. You also need 15 bits for tag, and since there are 4 lines, you need $$\log_2 (4) = 2$$ bits for LRU. This total is 18 bits.

What about Stores?

Where should you write the result of a store? One of two things could happen.

Address is in cache: Let's say that your address is in the cache. Should you update memory as well as the cache? This is known as a write-through policy. There is also a write-back policy, in which the memory is only updated at the end of the program.

Address is not in cache: One option is to just write to memory and not bring it into the cache (no allocate-on-write policy). Another option is to put the line into the cache (allocate-on-write policy).

It turns out that when you do thw write-through policy, you often write to the same address over and over again. If every time you update the line in the cache, you have to propagate that change to the memory, it's going to take forever

Write-through vs. Write-back

Can we also design the cache to NOT write all stores to memory immediately? We can keep the most recent copy in the cache and update the memory only when the data is evicted from the cache.

To keep track of this, you have to keep a dirty bit (chuckles), which is reset when the line is allocated and set when the block is stored. If a block is dirty when evicted, you write the data back to memory.

Example

Let's say you have a 32-bit memory address, a 64KB cache (useful data), a 64 byte cache block size, write-allocate, and write-back.

This cache will need 512 kilobits for the data area. Note that 1 kilobyte = 1024 bytes. Besides the actual cached data, this cache will need other storage.

How many additional bits (not counting the data) will be needed to implement this cache?

You will need a block offset size of $$\log_2(64) = 6$$ bits for the block offset size, and $$32 - 6 = 26$$ bits for the tag size.

Since we have a 64 KB cache and a 64 byte block size, you will have 1024 lines.

We will also need a valid bit, a dirty bit, 26 tag bits, and 10 bits for LRU ($$\log_2(1024) = 10$$). This should give a total of 38 bits.

Another Example

Suppose that accessing a cache takes 10ns, while accessing the main memory in case of cache-miss takes 100ns.

What is the average memory access time if the cache hit rate is 97%?

You need the equation for average memory access time (AMAT):

$$\text{Cache hit latency} + \text{cache miss rate} * \text{miss penalty}$$

In this case, we have:

$$10\text{ns} + (1 - 0.97)(100\text{ns}) = 13\text{ns}$$

Let's say you increase the hit rate by 1%, but you increase cache access time by 2ns. Will this help?

$$\text{AMAT} = (10 + 2\text{ns}) + (1 - 0.98)(100\text{ns} = 14\text{ns}$$

Note that this brings the entire lecture back to the original tradeoff: if you have a larger cache size, you will always increase the access time at the benefit of having a higher hit rate.