26.1-10 For what number of processors do the two versions of the chess program run equally fast, assuming that TP = T1/P + T∞? A chess lesson To illustrate the power of work/span analysis, this section closes with a true story that occurred during the development of one of the first world-class parallel chess-playing programs [106] many years ago. The timings below have been simplified for exposition. The chess program was developed and tested on a 32-processor computer, but it was designed to run on a supercomputer with 512 processors. Since the supercomputer availability was limited and expensive, the developers ran benchmarks on the small computer and extrapolated performance to the large computer. At one point, the developers incorporated an optimization into the program that reduced its running time on an important benchmark on the small machine from T32 = 65 seconds to seconds. Yet, the developers used the work and span performance measures to conclude that the optimized version, which was faster on 32 processors, would actually be slower than the original version on the 512 processors of the large machine. As a result, they abandoned the “optimization.” Here is their work/span analysis. The original version of the program had work T1 = 2048 seconds and span T∞= 1 second. Let’s treat inequality (26.4) on page 760 as the equation TP = T1/P + T∞, which we can use as an approximation to the running time on P processors. Then indeed we have T32 = 2048/32 + 1 = 65. With the optimization, the work becomes T′1 = 1024 seconds, and the span becomes T′∞ = 8 seconds. Our approximation gives T′32 = 1024/32 + 8 = 40. The relative speeds of the two versions switch when we estimate their running times on 512 processors, however. The first version has a running time of T512 = 2048/512+1 = 5 seconds, and the second version runs in seconds. The optimization that speeds up the program on 32 processors makes the program run for twice as long on 512 processors! The optimized version’s span of 8, which is not the dominant term in the running time on 32 processors, becomes the dominant term on 512 processors, nullifying the advantage from using more processors. The optimization does not scale up. The moral of the story is that work/span analysis, and measurements of work and span, can be superior to measured running times alone in extrapolating an algorithm’s scalability.
26.1-10
For what number of processors do the two versions of the chess
A chess lesson
To illustrate the power of work/span analysis, this section closes with a
true story that occurred during the development of one of the first
world-class parallel chess-playing programs [106] many years ago. The
timings below have been simplified for exposition.
The chess program was developed and tested on a 32-processor
computer, but it was designed to run on a supercomputer with 512
processors. Since the supercomputer availability was limited and
expensive, the developers ran benchmarks on the small computer and
extrapolated performance to the large computer.
At one point, the developers incorporated an optimization into the
program that reduced its running time on an important benchmark on
the small machine from T32 = 65 seconds to seconds. Yet, the
developers used the work and span performance measures to conclude
that the optimized version, which was faster on 32 processors, would
actually be slower than the original version on the 512 processors of the
large machine. As a result, they abandoned the “optimization.”
Here is their work/span analysis. The original version of the program
had work T1 = 2048 seconds and span T∞= 1 second. Let’s treat
inequality (26.4) on page 760 as the equation TP = T1/P + T∞, which
we can use as an approximation to the running time on P processors.
Then indeed we have T32 = 2048/32 + 1 = 65. With the optimization,
the work becomes T′1 = 1024 seconds, and the span becomes T′∞ = 8
seconds. Our approximation gives T′32 = 1024/32 + 8 = 40.
The relative speeds of the two versions switch when we estimate their
running times on 512 processors, however. The first version has a
running time of T512 = 2048/512+1 = 5 seconds, and the second version
runs in seconds. The optimization that speeds up
the program on 32 processors makes the program run for twice as long
on 512 processors! The optimized version’s span of 8, which is not the
dominant term in the running time on 32 processors, becomes the
dominant term on 512 processors, nullifying the advantage from using
more processors. The optimization does not scale up.
The moral of the story is that work/span analysis, and measurements
of work and span, can be superior to measured running times alone in
extrapolating an
Step by step
Solved in 3 steps with 1 images