Introduction to Computer Organization

Cache Organization: Direct-Mapped Caches

From last lecture, cache access looked like this:

  • You had an address
  • Fed it into a tag store (is the address in the cache? plus bookkeeping)
  • Get it from the data store

We also introduced the block size last lecture.

Fully Associative Caches

In a fully associative cache, a block of memory can be in any location in the cache. This means that you have to look everywhere to find anything.

So what's the problem?

  • You have to check every cache tag to determine whether the data is in the cache
  • This approach can be too slow sometimes
    • Parallel tag searches are expensive and can be slow, since you have to have a really big AND gate at the end in one clock cycle.

Solution:

Direct-Mapped Cache

Instead of looking everywhere, you're going to look for at most a single cache line. In a direct mapped cache, caches partition memory into as many regions as there are cache lines. Each memory region maps to a single cache line in which data can be placed, and then you only need to check a single tag - the one associated with the region the reference is in.

This also means that LRU goes away. If a block isn't in the cache, you have to kick whatever is in the cache line out. Two blocks in memory that map to the same index in the cache cannot be present in the cache in the same time.

This can lead to a 0% hit rate if more than one block accessed in an interleaved manner map to the same index. Let's say A and B map to the same line. If you had:

$$\text{ABABABABABAB} \dots$$

This would be called a conflict miss. Even though you may have two lines in the cache, this causes a problem.

Example

Let's say you have 32-bit memory addresses, byte addressed, direct-mapped 32kb cache, 128 byte block size, and write-back.

What are the overheads for this cache?

32KB/128B = 256 lines

1 valid bit + 1 dirty bit + 17 address bits = 19 bits

Since we have 256 lines, this is 19 * 256 = 4864 bits for all of the cache.

4864 bits / 32KB = about 1.9%.

Instruction Cache

If all loads and stores are really fast, but it takes a long time to load the instruction, then what's the point? There are two choices:

  • Treat instruction fetches as normal data and allocate cache lines when fetched
  • Create a second cache (called the instruction cache or ICache) which caches instructions only
    • Advantage: Instruction fetch can work at the same time as memory access (pipelining)

Integrating Caches into a Pipeline

How are caches integrated into a pipelined implementation? First, replace instruction memory with ICache (I$), and replace data memory with DCache (D$).

There are some issues, though. First, memory accesses now have variable latency: When you have a miss in your data cache, you need to tell the pipeline to stall. Also, you need to put a noop in the ICache and the Mem/WB stage. Also, both caches may miss at the same time. How would you fix that?

Another problem: let's say you do branch prediction, and you have a miss. This causes a memory access, and stalls. It needs to be able to throw away the address so it can correctly continue execution.

Example

The grinder application runs on the LC2k with full data forwarding and all branches predicted not-taken has the following instruction frequencies:

  • 45% R-type
  • 20% branches
  • 15% loads
  • 20% stores

In grinder, 40% of the branches are taken and 50% of the LWs are followed by an immediate use.

The I-cache has a miss rate of 3%, and the D-cache has a miss rate of 6% (no overlaps). On a miss, the main memory is accessed and has a latency of 100ns. The clock frequency is 500MHz.

What is the CPI of the grinder on LC2k?

First, you know that:

$$\text{CPI} = \text{baseline} + \text{stalls}$$

Since the main memory access latency is 100ns, this is 50 missed cycles. Therefore, 3 percent of the time, there are 50 missed cycles and 6 percent of the time, there are 50 missed cycles. However, this 6% only occurs during loads and stores, which is 35% of the time. Hence, this happens 2.1% of the time. The total memory access latency time is then 5.1% of the time, and 50 cycles delay each. The total CPI delay is 2.55.

Since 20% are branches, and 40% of these branches are taken, this means 40% are mispredicts. Therefore, you have to insert 2 noops (additional cycles) per mispredict - this happens 8% of the time, with a 3 cycle delay each. The total CPI stalls is then 0.24.

Since 50% of LWs are followed by an immediate use, you have to insert a noop after half of the loads. This happens 7.5% of the time, with a 1 cycle delay each. The total CPI stalls is then 0.075.

Therefore, the total CPI is:

$$\text{CPI} = 1 + (2.55 + 0.24 + 0.075) = 3.87$$