rpb131

The implementation of BLAS level 3 on the AP 1000

131. P. E. Strazdins and R. P. Brent, The implementation of BLAS level 3 on the AP 1000: Preliminary report, in [129], H1-H17.

Abstract: dvi (4K), pdf (82K), ps (32K).

Paper: dvi (36K), pdf (223K), ps (100K).

Abstract

The Basic Linear Algebra Subprogram (BLAS) library is widely used in many supercomputing applications, and is used to implement more extensive linear algebra subroutine libraries, such as LINPACK and LAPACK. The use of BLAS aids in the clarity, portability and maintenance of mathematical software. To take advantage of the high degree of parallelism of architectures such as the Fujitsu AP1000, BLAS level 3 routines (matrix-matrix operations) are proposed. These routines are not I/O bound; for n × n matrices, the number of arithmetic operations is O(n³), whereas the number of I/O operations is only O(n²).

We are concerned with implementing BLAS level 3 (BLAS-3) for single precision matrices on the AP1000, with emphasis on obtaining the highest possible performance without significantly sacrificing numerical stability. This paper discusses the techniques used to achieve this goal, together with the underlying issues.

The most important techniques are the use of software pipelining and loop unrolling for writing optimized assembler inner loops for matrix inner and outer products, which are able to operate at more than 90 percent and 70 percent, respectively, of the AP1000's theoretical peak performance.

The efficiency of cell communication using wormhole routing on the AP1000, especially the row/column broadcast, enables a sustained performance of 80 percent to 90 percent of the theoretical peak for all the BLAS-3 routines. It also means that many variations (using different communication schemes) for matrix multiplication have more or less equivalent performance. However, for future versions of the AP1000, optimizing communication must still be considered.

Techniques for improving the performance for large matrices (partitioning, to improve cache utilization) and for small matrices (minimizing communication) are employed. The latter have been developed for general rectangular AP1000 configurations.

Comments

An even more "preliminary" report appeared as [121]. A "final" (slightly shorter) version appeared as [136]. For related work see [128, 130].

Go to next publication

Return to Richard Brent's index page