rpb121

Implementing BLAS level 3 on the CAP-II

121. P. E. Strazdins and R. P. Brent, Implementing BLAS level 3 on the CAP-II, in [123], 121-129.

Abstract: dvi (3K), pdf (79K), ps (30K).

Paper: dvi (18K), pdf (151K), ps (69K).

Abstract

The Basic Linear Algebra Subprogram (BLAS) library is widely used in many supercomputing applications, and is used to implement more extensive linear algebra subroutine libraries, such as LINPACK and LAPACK. The use of BLAS aids in the clarity, portability and maintenance of mathematical software. BLAS level 1 routines involve vector-vector operations, level 2 routines involve matrix-vector operations, and level 3 routines involve matrix-matrix operations. To take advantage of the high degree of parallelism of architectures such as the Fujitsu CAP-II, BLAS level 3 routines are desirable. These routines are not I/O bound; for n × n matrices, the order of arithmetic operations is O(n³) whereas the order of I/O operations is only O(n²).

We are concerned with implementing BLAS level 3 for real matrices on the CAP-II, with emphasis on obtaining the highest possible performance, without sacrificing numerical stability. While the CAP-II has many features that make it very well-suited for this purpose, there are also many new challenges in implementing BLAS level-3 on a distributed memory parallel computer (these are currently being considered also by the authors of BLAS-3, who designed it primarily for cache and shared memory architectures).

One such challenge is its external interface: BLAS-3 subroutines can be called by the host program, with the CAP array used like a (very powerful) floating point unit. Alternatively, CAP cell equivalents of BLAS-3 subroutines may be called by the cell programs; this approach is more efficient but deviates from the BLAS standards. These issues are discussed.

We also discuss the high-level design of basic parallel matrix multiplication, transposition and triangular matrix inversion algorithms to be used by the BLAS-3 subroutines. With the efficient row/column broadcast available on the CAP-II using wormhole routing, "semi-systolic" algorithms, with a low startup time, appear to be superior to other algorithms. It is hoped that input from the workshop can help to finalize details of the high-level design.

On the lower levels of design, optimization of cell program codes (e.g. optimizing inner product or gaxpy loops, use of the SPARC cache) must be considered. Also relevant is the degree of optimization possible from the available CAP-II cell program compilers.

Comments

For related work see [130, 131, 136].

Go to next publication

Return to Richard Brent's index page