Implementation of the BLAS level 3
and Linpack benchmark on the AP1000
136. R. P. Brent and
P. E. Strazdins,
Implementation of the BLAS level 3 and Linpack benchmark on the AP1000,
Fujitsu Scientific and Technical Journal 29, 1 (March 1993),
61-70.
Abstract:
dvi (3K),
pdf (68K),
ps (29K).
Paper:
dvi (27K),
pdf (179K),
ps (86K).
Abstract
This paper describes an implementation of Level 3 of the Basic Linear Algebra
Subprogram (BLAS3) library and the Linpack Benchmark on the Fujitsu
AP1000. The performance of these applications is regarded as
important for distributed memory architectures such as the AP1000. We
discuss the techniques involved in optimizing these applications without
significantly sacrificing numerical stability.
Many of these techniques
may also be applied to other numerical applications. They include the use
of software pipelining and loop unrolling to optimize scalar processor
computation, the utilization of fast communication primitives on the AP1000
(particularly row and column broadcasting using wormhole routing),
blocking and partitioning methods, and "fast" algorithms (using less
floating point operations).
These techniques allow us to obtain a performance of
85-90 percent of the AP1000's
theoretical peak speed for the BLAS Level 3
procedures, and up to 80 percent for the Linpack benchmark.
Comments
For related work see
[128,
130,
131].
Go to next publication
Return to Richard Brent's index page