CSE 141L Lab # 1 - Time the AB benchmark

Due April 18 during class. Please see class homepage concerning late assignments and academic dishonesty policies.

This lab will involve timing a program for various size problem instances, and determining various performance measures of the program. The ultimate goal is to determine MFLOPS as a function of problem sizes, and to determine the maximum MIPS (millions of instructions executed per second) rate. But there's a lot of questions that need to be answered to get there.

You may do this first lab individually, or in groups of up to four people. If you have a groups of 3 or 4, you should do a little extra beyond the basic experiments listed below. This "extra" could be running the experiments using two different compilers, or to do them on an "unofficial" machine, or to do them on two different models of SUN workstations, or to try modifying the code to get better performance, or something else of your choosing. (Incidentally, you'll probably want a group of 3 or 4 people for later labs, so here's a chance to start working together.)

The program to be timed is ab.c, the "AB benchmark". The program was written for this class; it doesn't do anything useful, but is actually similar to a Fast Fourier Transform (FFT) caculation. FFT's are used extensively for computing such things as oilfield exploration, radar imaging, and modeling airflow over a car or airplane.

To run the benchmark, first compile the C program, for instance by using
gcc -O2 ab.c -o ab
You may use any "C" compiler you want. The compiled program takes two parameters, "problem-size" and "repetitions". The size parameter, which can be any value between 1 and 21, determines how big a problem is run. The "repetitions" parameter simply repeats the computation multiple times, which will enable you to time the small problems. For instance, running "ab 11 1000" will run a problem of size 11 for 1000 iterations. If this takes, say, 8 seconds, then the time for one iteration is about .008 seconds. WARNING: Large values may take excessive running time depending on your computer. You are not required to make any individual runs that take more than 1 minute of CPU time.

The "official computer" for the lab is a Sun, such as one of the ones in the uAPE lab, but you may use any computer you want. In fact, it would be interesting to have class members use a variety of machines. BUT, if you use an "unofficial" machine, I can't promise that everything will work smoothly. You can also use any compiler and other tools you want, but be sure to say what you used. And if you want, you can write a shell script to automate the running of the experiments for the last part, though it's not necessary.

You should collect information, perform experiments, and do the calculations to answer the questions that following. The write-up should contain enough information so that someone else could duplicate you experiments.

  1. What brand and model computer are you using?
  2. What is the processor's clock rate (in MHz)?
    On Sun's, the unix command "fpversion" may give the answer. Better yet would be to find some documentation of the clock speed. If the processor has a separate floating-point accelerator, give its clock rate, too.
  3. What result does "ab 10 100" produce?
    I'm curious whether all computer give the same answer. It's quite possible they don't, since not all machines do floating point arithmetic the same.
  4. What timers are you using?
    You need to find a timer that can gives the wallclock time, and one that gives user CPU time. On the Sun's, both are given by the "time" command. When you issue "time ab 11 1000", your progam (ab) will be run, and then you'll get a line like "3.0u 0.0s 0:03 95% ...". This says the program took 3.0 user CPU seconds, 0.0 system CPU seconds, the wallclock time was 0 minutes and 3 seconds, and the CPU was working on your program 95% of the time. To find out the name of your timer, the unix "which" command might work. For instance, typing "which time" on my machine reports "/usr/local/GNU/bin/time". There may be better timers, for instance, on my maachine, /usr/bin/time has better granularity than "time".
  5. What is the granularity of your timers?
    The question here is to figure out what the smallest non-zero unit of time is that the CPU timer can measure, and the smallest non-zero time that the wallclock timer can measure. You can do this by experimenting with a small problem size with different numbers of repetitions. (If either timer has a granularity of less than 1 microsecond, just report "Less than 1 us".)
    From now on, use enough repetitions to get answers within plus or minus 10%. For instance, if the granularity of the user CPU timer is 1 second, you need to run enough repetitions so the reported time is at least 10 seconds. Also, you need not run any experiments that would take longer than 1 minute of CPU time (though you can if you want to.)
  6. How much variation is there in your timings? Run the same experiment 7 times, and report the 7 resulting CPU times and wallclock times. Also, describe the results (e.g. "The wallclock times were all within 5% of each other except for one that was much bigger.")
  7. Which type of time will you use for the rest of the lab, and why? Depending on the results of your experiments so far, you probably will find that one type of time is more convenient or more reliable than the other.
    From now on, you only need to use one type of time (CPU or wallclock).
  8. How much improvement do compiler optimizations give on a problem of size 10?
    First compile the program with no optimizations (e.g., use gcc -O0 ) and time it, then compile it with a high level of optimization (e.g. using the "-O2" or "-fast" option) and time the program. Determine how much faster the program runs when compiled with optimizations. Also tell what compiler and options you used in the experiment.
    From now on, only use the optimized compiled code.
  9. How many floating point operations are executed as a function of the "size" parameter?
    You should be able to derive a formula just from reading the code, but you might want to put a counter in the inner loop to check. The answer should be a formula giving the number of floating point operations as a function of size.
  10. Make a table (or plot a graph) of MFLOPS versus problem size.
    For problem sizes from 1 to however far you can go while keeping the running time reasonable (less than 1 minute CPU time per run), compute the MFLOPS rate achieved by the program.
  11. What is the best MFLOPS rate you observed, and on what size problem? What percent is this of the machine's "peak MFLOPS" rate?
    If you don't know what the manufacturer claims as "peak MFLOPS", then compare your result to the performance one would get if the machine executed one floating point operation per cycle.
  12. How many instructions are executed per iteration of the inner loop?
    To find out, you need to produce an assembly listing (perhaps by compiling with the -S option ... and don't forget to use optimizations), and then locate the inner loop in the assembly code and count how many instructions there are between the label at the beginning of the loop and the last instruction of the loop. This can be tricky for a variety of reasons I'll go over in class. You should hand in the relevant part of the assembly listing (NOT the entire listing), with the inner loop start and end indicated.
  13. For the problem sizes where you observed the best and worst speeds, approximately how many MIPS are you getting?
    To make life simple, don't bother counting the instructions that aren't in the inner loop (Their contribution to the number of instructions executed is probably small, but it depends on the problem size and your compiler.)
  14. What is the best and worst CPI (cycles per instruction) you observed?
  15. If you did extra work (see note at the beginning about groups of three or four), what conclusions can you make from your extra experiments?
Remember -- this is supposed to be fun!