FDTD CUBE GRID BENCHMARK

A computation speed analysis for a FDTD algorithm benchmark was run on several computer systems for a series of problem sizes. The computational grid in each case is a cube, equally sized in three dimensions. The cube size ranged from 20x20x20 (0.5 megabytes, single precision), up to 150x150x150 (231 megabytes, single precision). The benchmark was run in single precision on the R12000, the only tested system which supports it, and double precision otherwise.

The FDTD Grid

An FDTD simulation grid is a rectangular volume of arbitary dimensions. The grid is structured and regular (sometimes called ijk), with six field components stored for each grid point. Each field component update requires eight floating point operations and seven values read during the calculation. Thus, a 20x20x20 grid (the smallest benchmark grid size) will require 8000 cells, or 48000 field components. Using 4-byte values (single precision), this grid requires 576,000 bytes of memory. Using 8-byte values (double precision), this grid requires 1,152,000 bytes of memory. One complete update of this grid requires 329,232 floating point operations and 288,078 values to be fetched from memory. This is equivalent to 1.1 megabytes of memory traffic (single precision), or 2.2 megabytes (double precision) per time step.

	SV1/MSP	T90	C90	SV1/SSP	R12000	J90se
Maximum	928	867	494	249	227	135
Average	727	588	355	185	145	111
Minimum	332	188	113	82	2*	58

* If the five worst-case points are removed from the graph, the minimum performance of the R12000 improves to 101 MFLOPS. The average then improves to 150 MFLOPS.

Analysis

T90 & C90

These computers have similar behavior, with the T90 exhibiting its clock-speed advantage over the C90. The graphs show stride-independent linear performance improvement as the size of the grid increases, with a single large discontinuity. Both systems have a maximum hardware vector length of 128 elements, which is apparent in the graph when the cube size exceeds 129. (The grid face field components, the first or last of each grid dimension, are not computed, so the effective grid size is always one smaller than the actual size.)

J90

The J90 curve is very similar to the C90 and T90 curves, except at a much lower performance. One difference is that the J90 has a maximum vector length of 64, so two discontinuities in the curve at 67 and 131.

SV1

Like the other vector computers, the SV1 shows an increase in performance for larger grid sizes. Since the grid sizes start at 1 megabyte, the 256 kbyte SV1 SSP cache is not effective. The 1 megabyte cache of the SV1 MSP helps with the smallest benchmark runs.

For both SV1 CPUs, a performance degradation is seen when the grid size is a multiple of 16, moreso for the MSP than the SSP. This drop off is due to memory bank busy conflicts, which can be eliminated through careful programming.

R12000

The R12000 curve has two distinct regions. For smaller grid sizes, the simulation fits within the 8 megabyte secondardy cache, resulting in an execution speed over 200 MFLOPS. Around grid size 50x50x50, the simulation spills out of the cache, and the average execution speed drops to 120 MFLOPS.

Otherwise, the performance is generally not very sensitive to the grid size, except for a few worst-case points. The R12000 secondary cache is 4-way set associative, while each FDTD field update requires 7 values to be fetched from main memory. Under circumstances where the grid size results in a stride where the field values map to the same cache location, extreme thrashing results in very poor performance. A more sophisticated implementation could avoid these worst-case grid sizes by increasing the grid size.

Hardware Configuration

SV1

The SV1 was system configured with 28 SSPs (standard CPUs) and 1 MSP (high-performance CPUs). Both CPUs have a 300 MHz clock speed. The peak single SSP computation speed of an SV1 is 1.2 gigaflops, the peak single MSP computation speed is 4.8 gigaflops. All calculations are performed in double precision (8 byte word).

T90

The T90 was a 4 processor system with a 440 MHz clock speed. This is slightly less than the standard T90 clock speed of 450 MHz. The peak single CPU computation speed of a standard T90 is 1.8 gigaflops. All calculations are performed in double precision (8 byte word).

C90

The C90 was an 8 processor system with a 240 MHz clock speed. The peak single CPU computation speed of a C90 is 960 megaflops. All calculations are performed in double precision (8 byte word).

J90

The J90se was a 32 processor system with a 100 MHz clock speed. The scalar execution unit, which is not exercised by the FDTD benchmark, is 200 MHz. The peak single CPU computation speed of a J90 is 200 megaflops. All calculations are performed in double precision (8 byte word).

R12000

The R12000 was a 300 MHz CPU with an 8 megabyte secondary cache running in a 64 processor Origin 2000 system. The node board (local) memory size was 512 megabytes. The peak computation speed of a R12000 is 600 megaflops. Calculations can be performed in single precision (4 byte word) or in double precision (8 byte word).