Abstract: dvi (3K), pdf (68K), ps (29K).
Paper: dvi (27K), pdf (179K), ps (86K).
Many of these techniques may also be applied to other numerical applications. They include the use of software pipelining and loop unrolling to optimize scalar processor computation, the utilization of fast communication primitives on the AP1000 (particularly row and column broadcasting using wormhole routing), blocking and partitioning methods, and "fast" algorithms (using less floating point operations). These techniques allow us to obtain a performance of 85-90 percent of the AP1000's theoretical peak speed for the BLAS Level 3 procedures, and up to 80 percent for the Linpack benchmark.
Go to next publication
Return to Richard Brent's index page