Remember that the memory in a modern system has many levels of cache. Typically, the more data that the memory can store, the slower it is. It turns out that a third of the core of a processor is a cache.
For each cache line, you have:
This has 5 bits of overhead for every 8 bits of data! This is a problem. How do we reduce this? What if instead of storing one byte per tag, we stored multiple bytes per tag?
Advantages:
Let's say you have 16 memory addresses, and originally a block size of 1. If you double the block size, the tag space is halved and hence you can actually shave a bit off of the tag field!
ld r1 M[1]
This loads memory locations 0 and 1. Originally the cache is empty, so this is a miss. The cache then looks like:
Notice how M[0] and M[1] are loaded. Note that the tag is 0 - this is the address of the first block of size 2 in memory.
Next, let's say you do:
ld r2 M[5]
This loads memory locations 4 and 5, and this is also a miss. The cache looks like:
Again, the tag is the address divided by block size (integer division).
Now, if you do:
ld r3 M[1]
This is a hit, since you have loaded the value before.
Here's where it gets interesting. Say you load M[4]. You haven't seen this address before, but it's in your cache since you loaded the block with M[5] in it already.
ld r4 M[4]
This is a hit.
Notice how when we accessed the address 1, we also brought in the address 0. This turned out to be a good thing, since we later referenced the address 0 and found it in the cache.
Due to spatial locality, if you reference a memory location (e.g., 1000), we are likely to reference a location near it (e.g., 1001). This is in contrast to temporal locality, which states that you are likely to load something you have loaded recently. This is the foundation of how cache works.
You have to store the tag and the block offset together. Let's say that you have 32 bits to do so. Let's look at a couple scenarios to determine how big the tag needs to be, and how big the blcok offset needs to be.
If your blocks have a size of one byte, then you do not need to store a block offset whatsoever. The tag size is 32 bits, and block offset size is 0 bits.
If your blocks have a size of two bytes, then you have to store a single bit for the block offset, and you have to use the remaining bits for the tag. The tag size is then 31 bits, and block offset size is 1 bit.
In general, the block offset size is always going to be $$\log_2 (b)$$, where $$b$$ is your block size in bytes. The tag size will always be the number of bits for the address (tag + offset), minus the block offset size.
Where does the best equilibrium lie? How big should the block size be to maximize locality?
Let's say that you have a super huge block size. If the block size is on the scale of your cache, then you only have one cache line. This is really inefficient. On the other hand, if your block size is one byte, you have to do a lot of repeated loads. How do we decide the best block size?
It turns out, computer engineers just take a bunch of applications that they believe are representative of what their processor will run in real life, and they simulate the performance of different cache sizes.
Most systems use a block size between 32 bytes and 128 bytes. These longer sizes reduce the overhead by:
When you load something from the memory into the cache, you have to replace the least recently used block. To keep track of this, hardware keeps a list. To store this ordering, you need some bits to store the ordering of the cache lines in the LRU list.
Since you need to store a number for each cache line entry, this amounts to:
$$\log_2 (l)$$
Where $$l$$ is the number of cache lines you have.
Let's say that you have a cache with total size 8 bytes. The block size is 2 bytes, and this has LRU replacement. The memory address size is 16 bits, and it byte addressable.
You know that there are 16 bits for the memory address size. Since the bits for block offset is $$\log_2 (b) = \log_2 (2) = 1$$, there are 15 bits remaining for the tag size.
Since there are 8 bytes for the cache's total size, and 2 bytes per block, there are 4 lines total.
Since you had three hits out of 9 loads total, this is a hit rate of 33%.
You always need a valid bit of size 1 bit. You also need 15 bits for tag, and since there are 4 lines, you need $$\log_2 (4) = 2$$ bits for LRU. This total is 18 bits.
Where should you write the result of a store? One of two things could happen.
Address is in cache: Let's say that your address is in the cache. Should you update memory as well as the cache? This is known as a write-through policy. There is also a write-back policy, in which the memory is only updated at the end of the program.
Address is not in cache: One option is to just write to memory and not bring it into the cache (no allocate-on-write policy). Another option is to put the line into the cache (allocate-on-write policy).
It turns out that when you do thw write-through policy, you often write to the same address over and over again. If every time you update the line in the cache, you have to propagate that change to the memory, it's going to take forever
Can we also design the cache to NOT write all stores to memory immediately? We can keep the most recent copy in the cache and update the memory only when the data is evicted from the cache.
To keep track of this, you have to keep a dirty bit (chuckles), which is reset when the line is allocated and set when the block is stored. If a block is dirty when evicted, you write the data back to memory.
Let's say you have a 32-bit memory address, a 64KB cache (useful data), a 64 byte cache block size, write-allocate, and write-back.
This cache will need 512 kilobits for the data area. Note that 1 kilobyte = 1024 bytes. Besides the actual cached data, this cache will need other storage.
You will need a block offset size of $$\log_2(64) = 6$$ bits for the block offset size, and $$32 - 6 = 26$$ bits for the tag size.
Since we have a 64 KB cache and a 64 byte block size, you will have 1024 lines.
We will also need a valid bit, a dirty bit, 26 tag bits, and 10 bits for LRU ($$\log_2(1024) = 10$$). This should give a total of 38 bits.
Suppose that accessing a cache takes 10ns, while accessing the main memory in case of cache-miss takes 100ns.
You need the equation for average memory access time (AMAT):
$$\text{Cache hit latency} + \text{cache miss rate} * \text{miss penalty}$$
In this case, we have:
$$10\text{ns} + (1 - 0.97)(100\text{ns}) = 13\text{ns}$$
$$\text{AMAT} = (10 + 2\text{ns}) + (1 - 0.98)(100\text{ns} = 14\text{ns}$$
Note that this brings the entire lecture back to the original tradeoff: if you have a larger cache size, you will always increase the access time at the benefit of having a higher hit rate.