In a fully-associative cache, a memory block can be associated with any cache line. In a direct-mapped cache, each memory block is associated with one of a discrete set of cache lines, which are given a color.
In a direct-mapped cache, you have three components:
The tag store has a decoder which takes the line index and maps it to a single tag. This tag then is fed in to the data store, which has the cache line we're looking for. Then, using the block offset, a MUX tells which of the bytes in the data block that you want. This returns a hit or miss, and the data you are looking for.
Set-associative caches have the advantages of both direct-mapped caches and fully associative caches. It partition memory into regions (fewer than direct mapped), and then associates a region to a set of cache lines. It checks tags for all lines in a set to determine a hit.
Another way to think about this: each line is treated like a small, fully associative cache. The search can be done in parallel, and this still allows for a LRU replacement policy.
Within each cache set, there may be multiple ways, or divisions of the set into blocks.
With each address, you store:
First, you use the set index to check which set you want. Rather than looking at one tag, you look at two tags, one in each way. You look at the tag bits for the blocks in both ways, and find which one matches. You take this tag address, and you index a data store with that address, mux it to choose the way, and mux it again with the block offset to get the value.
So now that we have this wonderful sparkling unicorn of a cache system, how could it ever go wrong? There are three ways.
Reduce this by increasing the cache size.
Increase the associativity.
Increase the block size.
In addition to the three methods above, you can also improve the replacement policy. There has been a lot of research on it.
Cache size is the total data (not including tag) capacity. Bigger can exploit temporal locality better. However, this is not always better - the bigger it is, the slower it is to access. Too large of a cache adversely affects the hit and miss latency. Smaller is faster and bigger is slower.
The working set size is a happy medium where you get most of the data you need in the cache and maybe not all of it.
It's the set of data you're using "now-ish".
Block size is the data that is associated with a tag. If blocks are too small, doesn't exploit spatial locality well and has larger tag overhead. If blocks are too big, you likely have a lot of data transfers into and out of the cache. There is a sweet spot around 32B-128B where you have the peak rate for block size.
How many blocks can we map to the same index (or set)?
With larger associativity, you have a lower miss rate and less variation among programs. However, there are diminishing returns. With smaller associativity, you have a lower cost, and faster hit time. This is important for L1 caches. You always want a power of 2 for block size and set size so you have the full addressable range with binary.