CSE 141L Lab # 1 - Time the AB benchmark
Due April 18 during class. Please see
class homepage concerning
late assignments and academic dishonesty policies.
This lab will involve timing a program for various size problem
instances, and determining various performance measures of the
program. The ultimate goal is to determine MFLOPS as a function
of problem sizes, and to determine the maximum MIPS (millions of
instructions executed per second) rate. But there's a lot of questions
that need to be answered to get there.
You may do this first lab individually, or in groups of up to four
people. If you have a groups of 3 or 4, you should do a little extra
beyond the basic experiments listed below. This "extra" could be
running the experiments using two different compilers, or to do them
on an "unofficial" machine, or to do them on two different models of
SUN workstations, or to try modifying the code to get better
performance, or something else of your choosing. (Incidentally,
you'll probably want a group of 3 or 4 people for later labs, so
here's a chance to start working together.)
The program to be timed is ab.c, the "AB
benchmark". The program was written for this class; it doesn't do
anything useful, but is actually similar to a Fast Fourier Transform
(FFT) caculation. FFT's are used extensively for computing such things
as oilfield exploration, radar imaging, and modeling airflow over a
car or airplane.
To run the benchmark, first compile the C program, for instance by
using
gcc -O2 ab.c -o ab
You may use any "C"
compiler you want. The compiled program takes two parameters,
"problem-size" and "repetitions". The size parameter, which can be any
value between 1 and 21, determines how big a problem is run. The
"repetitions" parameter simply repeats the computation multiple times,
which will enable you to time the small problems. For instance,
running "ab 11 1000" will run a problem of size 11 for 1000
iterations. If this takes, say, 8 seconds, then the time for one
iteration is about .008 seconds. WARNING: Large values may
take excessive running time depending on your computer. You
are not required to make any individual runs that take more than 1
minute of CPU time.
The "official computer" for the lab is a Sun, such as one of the
ones in the uAPE lab, but you may use any computer you want. In fact,
it would be interesting to have class members use a variety of
machines. BUT, if you use an "unofficial" machine, I can't promise
that everything will work smoothly. You can also use any compiler and
other tools you want, but be sure to say what you used. And if you
want, you can write a shell script to automate the running of the
experiments for the last part, though it's not necessary.
You should collect information, perform experiments, and
do the calculations to answer the questions that following.
The write-up should contain enough information so that someone
else could duplicate you experiments.
- What brand and model computer are you using?
- What is the processor's clock rate (in MHz)?
On Sun's, the unix command "fpversion" may give the answer.
Better yet would be to find some documentation of the clock speed.
If the processor has a separate floating-point accelerator, give
its clock rate, too.
- What result does "ab 10 100" produce?
I'm curious whether all computer give the same answer.
It's quite possible they don't, since not all machines do floating point
arithmetic the same.
- What timers are you using?
You need to find
a timer that can gives the wallclock time, and one that gives user CPU
time. On the Sun's, both are given by the "time" command. When you
issue "time ab 11 1000", your progam (ab) will be run, and then you'll
get a line like "3.0u 0.0s 0:03 95% ...". This says the program took
3.0 user CPU seconds, 0.0 system CPU seconds, the wallclock time was 0
minutes and 3 seconds, and the CPU was working on your program 95% of
the time. To find out the name of your timer, the unix "which" command
might work. For instance, typing "which time" on my machine reports
"/usr/local/GNU/bin/time". There may be better timers, for instance,
on my maachine, /usr/bin/time has better granularity than "time".
- What is the granularity of your timers?
The
question here is to figure out what the smallest non-zero unit of time
is that the CPU timer can measure, and the smallest non-zero time that
the wallclock timer can measure. You can do this by experimenting with
a small problem size with different numbers of repetitions. (If either
timer has a granularity of less than 1 microsecond, just report "Less than 1
us".)
From now on, use enough repetitions to get answers within
plus or minus 10%. For instance, if the granularity of the
user CPU timer is 1 second, you need to run enough repetitions so the
reported time is at least 10 seconds. Also, you need not run any
experiments that would take longer than 1 minute of CPU time (though
you can if you want to.)
- How much variation is there in your timings? Run
the same experiment 7 times, and report the 7 resulting CPU
times and wallclock times. Also, describe the results
(e.g. "The wallclock times were all within 5% of each other except for
one that was much bigger.")
- Which type of time will you use for the rest of the lab, and
why?
Depending on the results of your experiments so far, you probably
will find that one type of time is more convenient or more reliable
than the other.
From now on, you only need to use one type of time (CPU or
wallclock).
- How much improvement do compiler optimizations give
on a problem of size 10?
First compile the program with no optimizations (e.g., use gcc -O0 )
and time it, then compile it with a high level of optimization (e.g.
using the "-O2" or "-fast" option) and time the program. Determine how much
faster the program runs when compiled with optimizations.
Also tell what compiler and options you used in the experiment.
From now on, only use the optimized compiled code.
- How many floating point operations are executed as a
function of the "size" parameter?
You should be able to
derive a formula just from reading the code, but you might want to put
a counter in the inner loop to check. The answer should be a formula
giving the number of floating point operations as a function of size.
- Make a table (or plot a graph) of MFLOPS versus
problem size.
For problem sizes from 1 to however far you can go while keeping
the running time reasonable (less than 1 minute CPU time per run),
compute the MFLOPS rate achieved by the program.
- What is the best MFLOPS rate you observed, and on what
size problem? What percent is this of the machine's "peak MFLOPS" rate?
If you don't know what the manufacturer claims as "peak MFLOPS", then
compare your result to the performance one would get if the machine
executed one floating point operation per cycle.
- How many instructions are executed per iteration of the
inner loop?
To find out, you need to produce an assembly
listing (perhaps by compiling with the -S option ... and don't forget
to use optimizations), and then locate the inner loop in the assembly
code and count how many instructions there are between the label at
the beginning of the loop and the last instruction of the loop. This
can be tricky for a variety of reasons I'll go over in
class. You should hand in the relevant part of the assembly
listing (NOT the entire listing), with the inner loop start and end
indicated.
- For the problem sizes where you observed the best and
worst speeds, approximately how many MIPS are you getting?
To make life simple, don't bother counting the instructions that
aren't in the inner loop (Their contribution to the number of
instructions executed is probably small, but it depends on the problem
size and your compiler.)
- What is the best and worst CPI (cycles per instruction)
you observed?
- If you did extra work (see note at the beginning about
groups of three or four), what conclusions can you make from your
extra experiments?
Remember -- this is supposed to be fun!