Abstract: dvi (4K), pdf (82K), ps (32K).
Paper: dvi (36K), pdf (223K), ps (100K).
We are concerned with implementing BLAS level 3 (BLAS-3) for single precision matrices on the AP1000, with emphasis on obtaining the highest possible performance without significantly sacrificing numerical stability. This paper discusses the techniques used to achieve this goal, together with the underlying issues.
The most important techniques are the use of software pipelining and loop unrolling for writing optimized assembler inner loops for matrix inner and outer products, which are able to operate at more than 90 percent and 70 percent, respectively, of the AP1000's theoretical peak performance.
The efficiency of cell communication using wormhole routing on the AP1000, especially the row/column broadcast, enables a sustained performance of 80 percent to 90 percent of the theoretical peak for all the BLAS-3 routines. It also means that many variations (using different communication schemes) for matrix multiplication have more or less equivalent performance. However, for future versions of the AP1000, optimizing communication must still be considered.
Techniques for improving the performance for large matrices (partitioning, to improve cache utilization) and for small matrices (minimizing communication) are employed. The latter have been developed for general rectangular AP1000 configurations.
Go to next publication
Return to Richard Brent's index page