This lab will involve building a data cache simulator, and then determining the cache performance of three programs for a variety of cache designs.
The experiments will explore the effect of different choices for cache capacity, associativity, and block size (the length of a cacheline). Along the way, the lab may help explain the performance you observed in Lab 1, and to find out which implementation of sorting will interact better with the memory hierarchy.
You may do this lab in groups of two to four people. (The same work is required no matter what size the group.)
Before starting this lab, you should read Chapter 7 of the text.
The simulator should take parameters cs (the cache size, which is the amount of data that the cache can hold, not counting tags or dirty and valid bits), assoc (associativity), and bs (block size). cs and bs are measured in Bytes for this lab. (I apologize for not specifying units earlier.) You will probably find it most convenient to have the cache take these as runtime parameters. (You can look at ab.c from Lab1 to see how this is done.) Notice that if assoc=1, the cache is direct mapped, while if assoc = cs/bs, it is fully associative.
Your simulator should take as input a "trace" of memory references as produced by the three test programs. To get started, you might want to examine the tracing version of the AB benchmark to see the format of the trace produced by the test program.
Your cache is NOT required to hold the data that an actual cache would. It only needs to store the information necessary to determine the cache's performance would be. Thus, it needs to keep track of which addresses are stored in the cache at each point in time, and it will need to keep track of additional information about the lines in cache so that it can model the cache correctly.
Your simulator should implement the write back cache with write allocate. For set associative caches, it should use the least recently used replacement strategy. These terms are described at the end of the writeup.
You will turn in the following items:
Perform the following experiments and report the results. Important! For each result that your report, it must be clear in your write-up what the cache parameters you used were, that is, what values you used for cs, assoc, and bs.
For this first set of questions, for each "experiment" you should report the cache parameters you use, and the total number of reads in the trace, the total number of read that were cache misses, the total number of writes, the number of writes that were misses, and the number of times a dirty line was evicted from cache during the simulation (that is, the total number of lines that had to be written back to memory.)
For the next two of questions, use a 2-way set associative cache (assoc = 2) and total capacity 2KBytes. Suppose the miss_penalty is (12 + bs/8) cycles. Compute the "Memory-stall clock cycles" defined as miss_penalty x (number of read misses + number of write misses + number of dirty lines written back to memory). Pages 476 and 477 give some formulas that may be helpful.
Assume that 40% of the instructions for mergesort are either Reads or Writes. Also assume that if there are no memory stall cycles, then each instruction takes one cycle. We will explore the effect that associativity has on CPU time.
We now attempt to look at what was going on in Lab 1, but rather than using different sized problems with a single sized cache, you'll use a "size 10" problem on different sizes of cache.
Caches support two kinds of operations. Reads and writes. Either operation can result in a hit or a miss for a given address. So we can talk about write misses, write hits, read misses, and read hits. A good way to approach this lab is to think in terms of these four events. What steps does the cache need to do in each case to maintain itself? For example, data is ALWAYS brought into a cache from memory on a read miss. What does this mean as far as memory cycles are concerned, ...
Go through this process for each event, list the actions that have to happen, look for similarities across events. These will give you sections of code that be reused.
Try writing some high level pseudocode for the MemoryRead and MemoryWrite methods that takes advantage of the similarities you identified above.
One thing that can be confusing about this
lab is the chain of events that are being
simulated. This simulation involves three
entities:
CPU <----> CACHE <----> MEMORY
The cache is the middleman. All the traffic
between the CPU and memory goes through the cache.
The driver program acts as the CPU. Since our
simulation is not dealing with "real" data, we do
not have to simulate memory. The only thing we
need to know about it is the cost of a memory
access (read or write).
You need to keep clear in your mind the difference
between CPU-cache traffic and cache-memory traffic.
They are not necessarily the same. Consider the
following:
Initialize a counter to zero at the beginning of a program. Increment this counter each time you process another memory reference. For each cacheline that is stored in the (simulated) cache, record the current value of the counter each time that cacheline is referenced. Then, when you have to choose which of several lines to discard from the cache, you should choose the one corresponding to the samllest count.
Note: This strategy is simple, but it is limited by the type of variable used to hold the counter. Two-byte (short) integers are too small for some of our experiments, so you should use long or (better) unsigned long integers. It would be good programming style to check, each time the counter is about to be incremented, that the counter hasn't overflowed.