CSE 141L Lab # 2 - Cache simulations

Due April 29, hopefully in 141 class, but no later than 5:00 PM. Please see class homepage concerning late assignments and academic dishonesty policies.

This lab will involve building a data cache simulator, and then determining the cache performance of three programs for a variety of cache designs.

The experiments will explore the effect of different choices for cache capacity, associativity, and block size (the length of a cacheline). Along the way, the lab may help explain the performance you observed in Lab 1, and to find out which implementation of sorting will interact better with the memory hierarchy.

You may do this lab in groups of two to four people. (The same work is required no matter what size the group.)

Before starting this lab, you should read Chapter 7 of the text.

The simulator should take parameters cs (the cache size, which is the amount of data that the cache can hold, not counting tags or dirty and valid bits), assoc (associativity), and bs (block size). cs and bs are measured in Bytes for this lab. (I apologize for not specifying units earlier.) You will probably find it most convenient to have the cache take these as runtime parameters. (You can look at ab.c from Lab1 to see how this is done.) Notice that if assoc=1, the cache is direct mapped, while if assoc = cs/bs, it is fully associative.

Your simulator should take as input a "trace" of memory references as produced by the three test programs. To get started, you might want to examine the tracing version of the AB benchmark to see the format of the trace produced by the test program.

Your cache is NOT required to hold the data that an actual cache would. It only needs to store the information necessary to determine the cache's performance would be. Thus, it needs to keep track of which addresses are stored in the cache at each point in time, and it will need to keep track of additional information about the lines in cache so that it can model the cache correctly.

Your simulator should implement the write back cache with write allocate. For set associative caches, it should use the least recently used replacement strategy. These terms are described at the end of the writeup.

You will turn in the following items:

The three programs are
Please read and follow all instructions carefully! If anything isn't clear, please ask.

Perform the following experiments and report the results. Important! For each result that your report, it must be clear in your write-up what the cache parameters you used were, that is, what values you used for cs, assoc, and bs.


Testing the simulator's correctness.

For this first set of questions, for each "experiment" you should report the cache parameters you use, and the total number of reads in the trace, the total number of read that were cache misses, the total number of writes, the number of writes that were misses, and the number of times a dirty line was evicted from cache during the simulation (that is, the total number of lines that had to be written back to memory.)

  1. You won't do very well if your cache simulator is giving wrong answers. To test your simulator, try it for a cache that holds only one word (that is, cs=4 Bytes) on the heapsort benchmark. Report the results, and briefly explain why the answers are approximately correct.
  2. Now do the same thing with a cache that is large enough to hold all of the data referenced by the heapsort program in a single block.
  3. And again for a cache that has bs=4 Bytes, but cs is big enough to hold all of the data.
  4. Now (so we can see if you are getting the right answers for some other parameter choices) report your results for heapsort for these parameter choices:
    The effect of block size.

    For the next two of questions, use a 2-way set associative cache (assoc = 2) and total capacity 2KBytes. Suppose the miss_penalty is (12 + bs/8) cycles. Compute the "Memory-stall clock cycles" defined as miss_penalty x (number of read misses + number of write misses + number of dirty lines written back to memory). Pages 476 and 477 give some formulas that may be helpful.

  5. For merge sort, find the memory-stall clock cycles for all block sizes from 8 Bytes to the largest size that is possible. Report your results. Which is the best block size?
  6. Repeat the above for heap sort.
    The effect of associativity

    Assume that 40% of the instructions for mergesort are either Reads or Writes. Also assume that if there are no memory stall cycles, then each instruction takes one cycle. We will explore the effect that associativity has on CPU time.

  7. How many "CPU execution clock cycles" does mergesort have?
  8. Suppose that the Clock cycle time is (10 + log-base-2(assoc)) nanoseconds. For all possible power-of-2 choices of associativity for a cache with cs = 2048 Bytes and bs = 64 Bytes, determine the CPU time of mergesort. Assume the Miss penalty is 20 cycles.
    The effect of cache size

    We now attempt to look at what was going on in Lab 1, but rather than using different sized problems with a single sized cache, you'll use a "size 10" problem on different sizes of cache.

  9. For caches with bs = 64 Bytes and assoc = 1, try different values of cache size from 256 Bytes to 8 KBytes to see how many memory-stall clock cycles there are for abtrace. Assume, as before, that miss_penalty = 20.
    What have you learned?

  10. Based on the results of your experiments, answer these questions: End of questions.

Helpful (we hope) discussion

Caches support two kinds of operations. Reads and writes. Either operation can result in a hit or a miss for a given address. So we can talk about write misses, write hits, read misses, and read hits. A good way to approach this lab is to think in terms of these four events. What steps does the cache need to do in each case to maintain itself? For example, data is ALWAYS brought into a cache from memory on a read miss. What does this mean as far as memory cycles are concerned, ...

Go through this process for each event, list the actions that have to happen, look for similarities across events. These will give you sections of code that be reused.

Try writing some high level pseudocode for the MemoryRead and MemoryWrite methods that takes advantage of the similarities you identified above.

One thing that can be confusing about this lab is the chain of events that are being simulated. This simulation involves three entities:
CPU <----> CACHE <----> MEMORY
The cache is the middleman. All the traffic between the CPU and memory goes through the cache. The driver program acts as the CPU. Since our simulation is not dealing with "real" data, we do not have to simulate memory. The only thing we need to know about it is the cost of a memory access (read or write). You need to keep clear in your mind the difference between CPU-cache traffic and cache-memory traffic. They are not necessarily the same. Consider the following:


The textbook mentions the least recently used (LRU) replacement strategy, but does not give much detail about how to implement it. Here is one strategy you can use in your simulator.

Initialize a counter to zero at the beginning of a program. Increment this counter each time you process another memory reference. For each cacheline that is stored in the (simulated) cache, record the current value of the counter each time that cacheline is referenced. Then, when you have to choose which of several lines to discard from the cache, you should choose the one corresponding to the samllest count.

Note: This strategy is simple, but it is limited by the type of variable used to hold the counter. Two-byte (short) integers are too small for some of our experiments, so you should use long or (better) unsigned long integers. It would be good programming style to check, each time the counter is about to be incremented, that the counter hasn't overflowed.