### Solution of Toeplitz Systems of Equations A Systolic Array for the Linear-Time RICHARD P. BRENT and FRANKLIN T. LUK # JOURNAL OF VLSI AND COMPUTER SYSTEMS, VOLUME I, NUMBER 1 ### Solution of Toeplitz Systems of Equations\* A Systolic Array for the Linear-Time ## RICHARD P. BRENT and FRANKLIN T. LUK shown to require only $\theta(n)$ time and $\theta(n)$ storage, i.e. constant storage per systolic processor. Key words and phrases: Systolic arrays, Toeplitz matrices, linear equations, Bareiss dimensional systolic architecture is studied. Our implementation of an algorithm of Bareiss is Abstract—The solution of an $(n+1) \times (n+1)$ Toeplitz system of linear equations on a one- #### INTRODUCTION cated and may be slower than the $0(n^2)$ -time methods if n is not sufficiently of Levinson [19], Durbin [10], Trench [28], Zohar [29], and Barciss [2], all large (cf. Sexton [24]). recursions due to Schur [23] and Szego [27]. See also [6, 8, 9, 11, 12, 13, require time $0(n^2)$ and space 0(n) or $0(n^2)$ . These algorithms are based on tems, require only $0(n \log^2 n)$ time and 0(n) space. The well known algorithms son, and Yun [4] proposed procedures which, when applied to order-(n+1) sysessential (cf. [25]). Recently, Bitmead and Anderson [3], and Brent, Gustav-14, 18, 20]. Unfortunately, the $0(n \log^2 n)$ -time algorithms are quite compliapplications, for some of which (e.g., signal processing) real-time solution is Toeplitz systems of linear equations arise in many scientific and engineering array of O(n) processors. We present an implementation of the Barciss algodently by S.Y. Kung and Hu [16] (see also [17, 22]) and Ahmed, Delosme, and rithm that requires 0(n) time. Related work has recently been done indepen-In this paper we study the efficient solution of Toeplitz systems on a systolic <sup>\*</sup>Supported in part by U.S. Army Research Office grant DAAG 29-79-0124 and in part by the Centre for Mathematical Analysis and the Mathematical Sciences Research Centre at the Australi- Department of Computer Science, Australian National University, Camberra, ACT 2600, Aus- <sup>\*</sup>Department of Computer Science, Cornell University, Ithaca, New York Morf [1]. However, our results are more desirable in two respects: we do not assume that the Toeplitz matrix is symmetric (although we can take advantage of this), and our storage requirements are 0(n) instead of $0(n^2)$ . This saving in storage is significant because large Toeplitz systems often arise in filtering and signal processing applications [10, 19, 25, 26]: values of n greater than 1000 are common. Kung and Hu use a slightly different procedure, the so-called symmetric Bareiss algorithm (see [2]). In Section 7 we show how it is also possible to implement the symmetric Bareiss method using only 0(n) storage, except in a certain degenerate case, but the unsymmetric Bareiss method may still be preferred for reasons of numerical stability. Ahmed, Delosme, and Morf [1] use the HC algorithm [21] which requires the matrix to be positive definite. ### 2. THE BAREISS ALGORITHM Let us describe the Bareiss algorithm [2] for the solution of $$\mathbf{T}\mathbf{x} = \mathbf{b}\,,\tag{2.1}$$ where T is a Toeplitz matrix of order (n + 1) and **b** a column vector: $$T = \begin{pmatrix} t_0 & t_1 & t_2 & \cdots & t_n \\ t_{-1} & t_0 & t_1 & \cdots & t_{n-1} \\ t_{-2} & t_{-1} & t_0 & \cdots & t_{n-2} \\ \vdots & \vdots & \ddots & \vdots \\ t_{-n} & t_{-n+1} & t_{-n+2} & \cdots & t_0 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_n \end{pmatrix}$$ (2.2) For convenience we let the row and column indices run from 0 to n in this paper. The idea of Barciss is to transform (2.1) successively into $$T^{(-1)}\mathbf{x} = \mathbf{b}^{(-1)}, T^{(1)}\mathbf{x} = \mathbf{b}^{(1)}, T^{(-2)}\mathbf{x} = \mathbf{b}^{(-2)}, T^{(2)}\mathbf{x} = \mathbf{b}^{(2)},$$ (2.3) $\cdots, T^{(-n)}\mathbf{x} = \mathbf{b}^{(-n)}, T^{(n)}\mathbf{x} = \mathbf{b}^{(n)},$ so that the final matrices $T^{(-n)}$ and $T^{(n)}$ are upper and lower triangular, respectively. We introduce the shift matrices $$Z_{-i} = (z_{jk}^{(-i)}) = (\delta_{j-k-i}) \text{ and } Z_i = (z_{jk}^{(i)}) = (\delta_{j-k+i}),$$ (2.4) where $\delta$ is the Kronecker delta, and let $T^{(0)} = T$ , $b^{(0)} = b$ . The following recurrences then define the transformations on T and b: $$\mathbf{T}^{(-i)} = \mathbf{T}^{(-i+1)} - m_{-i} \mathbf{Z}_{-i} \mathbf{T}^{(i-1)}, \quad \text{with } m_{-i} = \frac{t_{i,0}^{(1-i)}}{t_0},$$ $$\mathbf{b}^{(-i)} = \mathbf{b}^{(-i+1)} - m_{-i} \mathbf{Z}_{-i} \mathbf{b}^{(i-1)},$$ $$\mathbf{T}^{(i)} = \mathbf{T}^{(i-1)} - m_i \mathbf{Z}_i \mathbf{T}^{(-i)}, \quad \text{with } m_i = \frac{t_{0,i}^{(i-1)}}{t_{n,n}},$$ (2.5) $$\mathbf{b}^{(i)} = \mathbf{b}^{(i-1)} - m_i Z_i \mathbf{b}^{(-i)}$$ In (2.5) the effect of premultiplication by $Z_{-i}$ is to downshift the matrix $T^{(i-1)}$ by i rows and to replace its first i rows by zeros. Similarly, $Z_i$ upshifts $T^{(-i)}$ by i rows and replaces its last i rows by zeros. Suppose that the matrix $T^{(-i+1)}$ (resp. $T^{(i-1)}$ ) has (i-1) null subdiagonals (resp. superdiagonals). The operations described by (2.5) will annihilate the i-th subdiagonal (resp. superdiagonal) without disturbing those already null elements. It follows that the matrix $T^{(-n)}$ (resp. $T^{(n)}$ ) will be upper (resp. lower) triangular. **Figure 1.** Structure of $T^{(-1)}$ and $T^{(1)}$ in the Bareiss algorithm. lower triangular, of T is then given by (c.f. Sweet [26]) trices of the given matrix T are nonsingular. An LU-factorization, with L unit The Bareiss method will not fail as long as all the leading principal subma- where $$L = \frac{1}{t_0} (T^{(n)})^{T_2},$$ $$U = T^{(-n)},$$ (2.6) positive definite, so the lack of pivoting may not be too severe a handicap. nation without pivoting. In many applications T is either diagonally dominant or The Bareiss method, thus, has the same numerical properties as Gaussian elimiand the superscript "T2" denotes matrix transpose about the main antidiagonal cuss a symmetric variant of the Bareiss alogrithm that can be more efficient when T is symmetric. algorithms, see Sweet [26] and Cybenko [7]. In Section 7 we shall briefly disclosely related, but the Bareiss algorithm appears to be more amenable to parallel computation. For further comments on the numerical properties of these Sweet [26] shows that the Bareiss and the Trench-Zohar algorithms are ## A SYSTOLIC ARRAY FOR FACTORIZING T n-1. The output line outd<sub>-k</sub> is connected to the input line ind<sub>-k+1</sub>, for The output line out $u_k$ is connected to the input line $inu_{k-1}$ , for $k=1,2,\ldots$ , sor has two output lines outu<sub>k</sub> and outd<sub>k</sub>, and two input lines inu<sub>k</sub> and ind<sub>k</sub>. registers $U_k$ and $D_k$ . We denote the content of register R by [R]. Each processors are identical except for the middle one P<sub>0</sub>. For simplicity we first assume angular matrices $T^{(-n)}$ and $T^{(n)}$ . The array consists of 2n-1 processors $P_{-n+1}$ , $U_{0-}$ , $U_{0+}$ , $D_{0-}$ , and $D_{0+}$ , and each remaining processor $P_k(k \neq 0)$ has two that P<sub>0</sub> can broadcast a scalar quantity to all other processors in constant time. This assumption will soon be dropped. The processor P<sub>0</sub> has four registers $P_{-n+2}, \ldots, P_{-1}, P_0, P_1, \ldots, P_{n-1}$ , arranged from left to right. All process We present here a one-dimensional systolic array for computing the two tri- The processor array for n = 4. BRENT AND LUK: A SYSTOLIC ARRAY Before iterating, we feed data into the array so that $$[U_{0-}] = t_0, \quad [U_{0+}] = t_1, [D_{0-}] = t_{-1}, \quad [D_{0+}] = t_0,$$ (3.1) and $$[U_{-k}] = t_{-k}, [U_k] = t_{k+1}, [D_{-k}] = t_{-k-1}, [D_k] = t_k, (3.2)$$ for k=1, 2, ..., n-1. We now consider one iteration, say the *i*-th one. Each iteration consists of three steps. At step one the processor $P_0$ computes the multiplier $$m_{-i} = \frac{[D_{0-}]}{[U_{0-}]}.$$ (3.3) This value is broadcast to all processors. Processor $P_k$ , all k, now computes (i) $$\equiv [D_k] - m_{-i}[U_k]$$ . (3.4) The quantity $[D_k]$ , for $k \ge 0$ , is sent out on the output line outd<sub>k</sub>, and the result (i) is stored in register $D_k$ . At step two, the processor $P_0$ computes the multiplier $$m_i = \frac{[U_{0+}]}{[D_{0+}]} \tag{3.5}$$ and broadcasts this number. The processor $P_k$ , for all k, then computes (ii) $$\equiv \{U_k\} - m_i\{D_k\}.$$ (3.6) The quantity $|\mathbb{U}_k|$ , for $k \le 0$ , is output on line $\mathrm{out}_k$ , and the register $\mathbb{U}_k$ receives the value of (ii). At the third step, we do data transfer. The content of register $\mathbb{D}_k$ , for $k \le -1$ , is sent to processor $\mathbb{P}_{k+1}$ on line $\mathrm{outd}_k$ . The incoming data for $\mathbb{P}_{k+1}$ is stored in register $\mathbb{D}_{k+1}$ . The content of register $\mathbb{D}_{0-}$ is lost. At the same time, the content of register $\mathbb{U}_k$ , for $k \ge 1$ , is sent to processor $\mathbb{P}_{k-1}$ , on line $\mathrm{outu}_k$ . The incoming data for $\mathbb{P}_{k-1}$ is stored in register $\mathbb{U}_{k-1}$ , and the content of register $\mathbb{U}_{0+}$ is lost. We now disable processors $\mathbb{P}_{-n+i}$ and $\mathbb{P}_{n-i}$ . This completes one iteration of the Bareiss algorithm. Since we disable the two end processors after each iteration, our method must terminate after n iterations. Let us denote the output on line outd<sub>k</sub> at the i-th iteration by $d_k^{(i)}$ and the corresponding output on outu<sub>k</sub> by $u_k^{(i)}$ . The two desired matrices $T^{(-n)}$ and $T^{(n)}$ are given by By transforming the right-hand vector **b** simultaneously (see Section 4), we obtain an upper triangular system $\mathbf{T}^{(-n)}\mathbf{x} = \mathbf{b}^{(-n)}$ to solve. So $\mathbf{T}^{(n)}$ can be discarded. Since $\mathbf{T}^{(-n)}$ is not Toeplitz, the Bareiss algorithm appears to require $O(n^2)$ storage. However, at the expense of some extra computation, we can avoid using more than O(n) storage (the details are presented in Sections 5 and 6). This important point is not observed in [1, 2, 16]. We show now that broadcasting is not needed if each processor $P_k$ (resp. $P_{-k}$ ), k>0, can pass a scalar quantity to its outer neighbor $P_{k+1}$ (resp. $P_{-k-1}$ ) and if $P_0$ can pass a number to both $P_1$ and $P_{-1}$ . Let us describe how the *i*-th iteration (say) proceeds. The middle processor $P_0$ reads the inputs on lines inu<sub>0</sub> and ind<sub>0</sub>, and stores the numbers in registers $U_{0+1}$ and $U_{0+1}$ respectively. The computing starts with $P_0$ calculating the multiplier $m_{-i}$ and passes the value to its two neighbors $P_{-1}$ and $P_1$ . The processor $P_0$ now performs the second step of the iteration by computing the multiplier $m_i$ and again passes its value to processors $P_{-1}$ and $P_1$ . The processor $P_1$ (resp. $P_{-1}$ ), on receiving $m_{-i}$ , will do the computation (3.4) and pass the multiplier to the neighbor $P_2$ (resp. $P_{-1}$ ). Then $P_1$ (resp. $P_{-1}$ ) will receive the multiplier $m_i$ . The operations described in (3.6) are now performed and the value of $m_i$ will be passed to processor $P_2$ (resp. $P_{-1}$ ) completes its share of the iteration by sending the content of register $U_1$ (resp. $D_{-1}$ ) leftward (resp. rightward) to processor $P_0$ . The processors $P_2$ and $P_2$ , on receiving the multipliers $m_{-i}$ and then $m_i$ , will perform the necessary computations, the passing on of the multipliers and finally the shifting of the necessary information toward the middle processor. This process expands outward until the two end processors $P_{n-i}$ and $P_{-n+i}$ have done their tasks, ending the iteration. An important observation here is that the (i+1)-st iteration can start as soon as processor $P_0$ receives the data from its two neighbors. Our technique for avoiding broadcast is quite common [5, 15]: processors are active only on alternate time steps $(P_0, P_{+2}, \dots)$ at time $\tau = 1, 3, \dots$ and $P_{-1}, P_{+3}, \dots$ at time $\tau = 2, 4, \dots$ ), and the operations of processors $P_{-1}$ are delayed by k time steps relative to the operation of processor $P_0$ . Suppose that the given matrix T is banded with half-bandwidth w. We can, of course, disregard this special structure and still use 2n-1 processors. But as processors $P_{\pm (w-1)}$ , $P_{\pm w}$ , ..., $P_{\pm (n-1)}$ work only with null data, we may perform the elimination using only 2w-3 processors if the input lines inu $_{w-2}$ and ind $_{-w-1,2}$ always carry the number zero. We will disable the processors $P_{n-i}$ and $P_{-n+i}$ at the end of the i-th iteration, for $i=n-w+2, n-w+3, \ldots, n-1$ . Example 1. (Bareiss [2, p. 415]). $$T = 120 \begin{pmatrix} 1 & 2 & 3 & 4 & 5 \\ 2 & 1 & 2 & 3 & 4 \\ 3 & 2 & 1 & 2 & 3 \\ 4 & 3 & 2 & 1 & 2 \\ 5 & 4 & 3 & 2 & 1 \end{pmatrix}$$ | 2.2 1112 - | 2.3 | 3.2 m 1 = | 3, | 4.2 m <sub>4</sub> = | - | |------------|---------|----------------------------|---------|----------------------|---------| | 40<br>320 | shift : | $3.2 m_1 = \frac{30}{300}$ | shift • | 24 | shift * | | <b>x</b> - | | 5 - | | 2 - | | | 180 | 180 | | | | | | | | | | | | | 150 | 150 | 4 | 4 | | | | 120 | 120 | 120 | 120 | 120 | 120 | | c | ő | 0 | 24 | c | | | Ē | 3 | 24 | | | | | 3 | | | | | | | | 4.3 s | $4.1 \ m_{-4} = \frac{-60}{120}$ | 3.3 s | $3.1 \ m_{-3} = \frac{-80}{120}$ | 2.3 s | 2.1 m. <sub>2</sub> = | 1.3 s | 1.1 m -1. | | | 1.2 m <sub>1</sub> = - | 1.3 s | |----------------|---------|----------------------------------|----------|----------------------------------|----------|-------------------------|-----------|--------------------------------|-----|-----|-----------------------------------|---------| | | shift → | $\frac{-60}{120} = -\frac{1}{2}$ | shift→ | $\frac{-80}{120} = -\frac{2}{3}$ | shift → | $\frac{-120}{120} = -1$ | shift → | $m_{-1} = \frac{240}{120} = 2$ | | | $\frac{240}{-360} = -\frac{2}{3}$ | shift ← | | P | | | | | | | | - 360 | 600 | 480 | 240 | 240 | | P2 | | | | | | 160 | - 360 | -240 | 480 | 360 | 200 | 200 | | P | | | | 60 | -160 | -80 | -240 | -120 | 360 | 240 | 160 | 160 | | | | 0 | -60 -300 | 0 - | -80 -320 | 0 | -120 -360 | 0 - | 240 | 120 | 120 | 120 | | Po | -288 | -288 | 300 | 0 300 | -320 | 0 -320 | 360 | 0 - 360 | 120 | 240 | 0 | 8 | | P | | | - 360 | -360 | -400 | 400 | -480 | 480 | 240 | 360 | <del>\$</del> | 8 | | Р2 | | | | | - 480 | 480 | -600 | -600 | 360 | 480 | 80 | 120 | | P <sub>3</sub> | | | | | | | 720 | -720 | 480 | 600 | 120 | | Hence | | | T(-4 | | | |------|------|---------|---------|-------| | | | T(-4) = | | [ 120 | | 0 | | | - 360 - | 240 | | | | -320 | -480 | 360 | | | -300 | -400 | -600 | 480 | | -288 | -360 | -480 | -720 | 600 | and | | | $T^{(4)} =$ | | | |------|-----------|-------------|-----|-----| | 600 | 240 | 180 | 44 | 120 | | 480 | 200 | 150 | 120 | | | .360 | <u>(8</u> | 120 | | | | 240 | 120 | | 0 | | | 120 | | | | | | | | | | | Note that T = LU, where $U = T^{(-4)}$ and $$L = \frac{1}{120} (T^{(4)})^{T2} = \begin{pmatrix} 1 & 0 \\ 2 & 1 & 0 \\ 3 & \frac{4}{3} & 1 \\ 4 & \frac{5}{3} & \frac{5}{4} & 1 \\ 5 & 2 & \frac{3}{2} & \frac{6}{5} & 1 \end{pmatrix}$$ ### MODIFYING THE RIGHT-HAND VECTOR We want to find the vector $\mathbf{b}^{(-n)}$ that satisfies $$\mathbf{T}^{(-n)}\mathbf{x} = \mathbf{b}^{(-n)}$$ We determine this vector from **b** using the recurrences (2.5). We, thus, need only the multipliers generated in the factorization phase and not the lower triangular matrix $\mathbf{T}^{(n)}$ . The structure and the operations of the systolic array are very similar to the right-half of the systolic architecture of the previous section. The array here consists of the n processors $Q_0, Q_1, \ldots, Q_{n-1}$ . All the processors are identical. Figure 3. Systolic Array for Finding $b^{(-n)}$ (n = 4). Each $Q_k$ has two registers $U_k$ and $D_k$ , one output line outd<sub>k</sub> and one input line ind<sub>k</sub>. The lines ind<sub>k</sub> and outd<sub>k+1</sub> are connected (for $k=0,1,\ldots,n-2$ ). We also assume temporarily that all processors can receive the broadcast of a scalar quantity. Before iteration starts, data are fed into the processors so that, for $(=0,1,\ldots,n-1)$ , $$\begin{aligned} |\mathsf{U}_k| &= b_k \,. \\ |\mathsf{D}_k| &= b_{k+1} \,. \end{aligned} \tag{4.1}$$ where $\mathbf{b} = (b_0, b_1, \dots, b_n)^T$ . Let us consider the *i*-th iteration, where $i \ge 1$ . An iteration consists of three steps, same as in the previous section. At the first step, the multiplier $m_{-i}$ arrives at processor $Q_k$ , which then computes $$(iii) = [D_k] - m_{-i} [U_k],$$ (4.2) and stores (iii) in register $D_k$ . The second step begins when $Q_k$ receives the multiplier $m_i$ . It computes $$(iv) = [U_k] - m_i [D_k], \qquad (4)$$ and stores the result in $U_k$ . The third step is now initiated. It involves a transfer of the content of register $D_k$ from processor $Q_k$ to processor $Q_{k-1}$ , for $k \ge 1$ . The content of $D_1$ is sent out on outd<sub>0</sub> and register $D_k$ , for $k \ge 0$ , receives the content of register $D_{k+1}$ . The completion of the data transfer ends the *i*-th iteration and processor $Q_{n-i}$ is disabled. If we denote the output on line outd<sub>0</sub> after the *i*-th iteration by $d_0^{(k)}$ , then the vector $\mathbf{b}^{(-n)}$ is given by $$\mathbf{b}^{(-n)} = \begin{pmatrix} b_0 \\ d_0^{(1)} \\ d_0^{(2)} \\ \vdots \\ \vdots \\ d_0^{(n)} \end{pmatrix}$$ $$(4.4)$$ Since the algorithms in this and the previous sections are very similar, we can argue using the same reasonings as before that broadcasting is unnecessary if each processor $Q_k$ can pass the multiplier to its right neighbor $Q_{k+1}$ and if the operation of processor $Q_k$ is delayed by k time steps relative to the operation of processor $Q_0$ . ### **Example 2.** (Bareiss [2, p. 415)] | , | | b = 120 | | | |----|----|---------|----|----| | 30 | 20 | 8 | 22 | 30 | | | | | | | $$m_4 = -\frac{1}{12}$$ $$m_3 = -\frac{1}{10}$$ $$120$$ $$384$$ $$m_2 = -\frac{1}{8}$$ $$240$$ $$390$$ $$840$$ $$m_1 = -\frac{2}{3}$$ $$3600$$ $$2640$$ $$2640$$ $$2160$$ $$2400$$ $$2640$$ $$2160$$ $$2400$$ $$3600$$ $$m_{-1} = 2$$ $$-4560$$ $$-3120$$ $$-1920$$ $$-1200$$ $$\sinhift \leftarrow$$ $$-2560$$ $$-1360$$ $$-320$$ $$m_{-3} = -\frac{2}{3}$$ $$-1200$$ $$-60$$ 2.2 1.2 $$\mathbf{b}^{(-4)} = \begin{pmatrix} 3600 \\ -4560 \\ -2560 \\ -1200 \\ 0 \end{pmatrix}$$ ## 5. REGENERATING T<sup>(-n)</sup> USING 0(n) STORAGE We consider the regeneration of the upper triangular matrix $T^{(-n)}$ using only its last column and the 2n multipliers $m_{\pm i}$ . Our key idea is to run the elimination algorithm in Section 3 backwards: $$\mathbf{T}^{(i-1)} = \mathbf{T}^{(i)} + m_i Z_i \mathbf{T}^{(-i)}.$$ $$\mathbf{T}^{(-i+1)} = \mathbf{T}^{(-i)} + m_{-i} Z_{-i} \mathbf{T}^{(i-1)}.$$ (5.1) for $i = n, n-1, \ldots, 1$ . (Observe that rows 0 to i for $T^{(-n)}$ are equal to rows 0 to i of $T^{(-i)}$ .) So our systolic array consists of n identical processors $B_0, B_1, \ldots, B_{n-1}$ . Each processor $B_k$ has two registers $U_k$ and $D_k$ . Initially, $$\{U_k\} = 0$$ , and $\{D_k\} = t_{n-k,n}^{(-n)}$ . (5.2) for $k=0, 1, \ldots, n-1$ . We again assume for a moment that there is a broadcasting mechanism. Each processor $B_k$ has two output lines out $u_k$ and out $d_k$ and one input line in $u_k$ . The lines out $u_k$ and in $u_{k+1}$ are connected, for $k \ge 0$ . **Figure 4.** A systolic array for generating $T^{(-n)}$ (n = 4). 4,3 shift ← 0 O 0 2 Ç 4.1 $m_{-4} =$ 3.3 shift ← -1200 -60 3.1 1.1 1.3 2.1 2.3 following computation is done: Each iteration consists of three steps. Let us describe the i-th iteration. At the first step, the multiplier $m_{n+1-i}$ is broadcast to all the processors and the the (i+1) processors $B_0, B_1, \ldots, B_i$ are active during the (i+1)-st iteration. iteration, $i \ge 1$ , processor B, is activated for subsequent computations so that Only one processor, $B_0$ , is active for the initial iteration. At the end of the *i*-th $$(v) = \{ U_k \} + m_{n+1-i} \{ D_k \}, \tag{5.3}$$ $B_k$ $(0 \le k \le i - 1)$ then computes starts when the multiplier $m_{-n-1+i}$ is broadcast to all processors. Processor for k = 0, 1, ..., i-1. The result (v) is stored in register $U_k$ . The second step $$(vi) \equiv [D_k] + m_{-n-1+i} [U_k]. \tag{5.4}$$ k=0, 1, ..., i-1. Register $U_0$ will contain the number zero. The complete procedure stops after n iterations. outputs the result (vi) on outd<sub>k</sub> and also stores the number in register $D_k$ . The third step is but a shifting of the content of register $U_k$ to register $U_{k+1}$ , for matrix $T^{(-n)}$ is given by If we denote the output on line outd<sub>k</sub> at the *i*-th iteration by $d_k^{(i)}$ , the desired | | T(-") = | | |----------------------------------------------------------------------------------------|---------|------------------------------------| | 0 | | $d_0^{(n)}$ | | | | $d_{1}^{(m)}$ | | $d_0^{(2)}$ | | : : | | d₁ <sup>(2)</sup> | | $d_{n-1}^{(n)} \\ d_{n-2}^{(n-1)}$ | | $\begin{pmatrix} (-n) \\ t_{n-2,n} \\ \binom{(-n)}{n-1,n} \\ t_{n,n} \\ \end{pmatrix}$ | | $\mathbf{t}_{1,n}^{(-n)}$ | | | (5.5) | | cessor can pass a scalar quantity to its right neighbor. As before, we can argue that broadcasting is unnecessary as long as each pro- tolic array, it is interesting to note that we have regenerated the elements of substitution [15]. The details are given in the next section. $\mathbf{T}^{(-n)}$ in the exact order as required by the Kung-Leiserson algorithm for back Since our primary concern is the solution of $T^{(-n)}x = b^{(-n)}$ on a linear sys- ### Example 3. (See Example 1). | $1.2 m_{-4} = -\frac{1}{2}$ $2.2 m_{-3} = -\frac{2}{3}$ $3.2 m_{-2} = -1$ $4.2 m_{-1} = 2$ | | 1.1 $m_4 = -\frac{1}{12}$ | 1.3 shift $\rightarrow$ | $2.1 \ m_3 = -\frac{1}{10}$ | 2.3 shift $\rightarrow$ | $3.1 \ m_2 = -\frac{1}{8}$ | 3.3 shift $\rightarrow$ | $4.1 \ m_1 = -\frac{2}{3}$ | 4.3 shift→ | • | |--------------------------------------------------------------------------------------------|-----------|---------------------------|-------------------------|-----------------------------|-------------------------|----------------------------|-------------------------|----------------------------|------------|---| | -300<br>-320<br>-360<br>-360<br>120<br>B <sub>0</sub> | -288 | 24 | 0 | 30 | 0 | 40 | 0 | 240 | 0 | | | -400<br>-480<br>240<br>B <sub>1</sub> | -360 | | 24 | 60 | 30 | 80 | 40 | 360 | 240. | | | 600<br>360<br>В <sub>2</sub> | 0<br>-480 | | | | 8 | 120 | 80 | 480 | 360 | | | 480<br>B <sub>3</sub> | 0<br>-720 | | | | | | 120 | 600 | 480 | | | | | | | | | | | | | | #### We get | | | T <sup>(-4)</sup> | | | |------|------|-------------------|------|-----| | | | 11 | | | | | | | | 120 | | | 0 | | -360 | 240 | | | | -320 | -480 | 360 | | | -300 | -400 | -600 | 480 | | -288 | -360 | -480 | -720 | 600 | | - | | | | | ### BRENT AND LUK: A SYSTOLIC ARRAY ### **6.** A COMPLETE ARCHITECTURE We can construct one systolic array that solves the given equations $T\mathbf{x} = \mathbf{b}$ . Because of the similarities in their operations, processors $P_{\pm k}$ and $Q_k$ $(k \ge 0)$ are combined into one super-processor $S_k$ $(k \ge 0)$ . We then program $S_k$ $(k \ge 0)$ to do the regeneration of $\mathbf{T}^{(-n)}$ and the solution of $\mathbf{T}^{(-n)}\mathbf{x} = \mathbf{b}^{(-n)}$ . Let us describe our linear array of n+1 super-processors $S_0$ , $S_1$ , ..., $S_n$ . (The last processor $S_n$ is needed for the back substitution.) In the Bareiss algorithm four triangular Toeplitz matrices $$a = \begin{bmatrix} a_0 & 0 \\ a_1 & a_1 \\ a_0 \end{bmatrix}, \quad \beta = \begin{bmatrix} \beta_0 & \beta_1 \\ \beta_1 \\ 0 & \beta_0 \end{bmatrix}, \quad \gamma = \begin{bmatrix} \gamma_0 & 0 \\ \gamma_1 & \gamma_0 \\ \gamma_1 & \gamma_0 \end{bmatrix}, \text{ and } \delta = \begin{bmatrix} \delta_0 & \delta_1 \\ \delta_1 & \delta_1 \\ 0 & \delta_0 \end{bmatrix}$$ are updated (see Figure 1). Now each processor $S_k$ has registers to store $\alpha_k$ , $\beta_k$ , $\gamma_k$ , and $\delta_k$ . (When describing processor $S_k$ we shall omit the subscripts and simply refer to registers $\alpha$ , $\beta$ , $\gamma$ , and $\delta$ .) Processor $S_k$ requires four additional registers: $\lambda_k$ for a multiplier $m_{-j}$ , $\mu_k$ for a multiplier $m_{+j}$ , and $\xi_k$ and $\eta_k$ which are associated with the right-hand side vector **b** and the solution **x**. Data flows in both directions between adjacent processors, as shown in Figure 5. Hence, each processor needs five input and five output data paths denoted by <u>inL1</u>, <u>inL2</u>, <u>inR1</u>, <u>inR2</u>, <u>inR3</u>, <u>outL1</u>, <u>outL2</u>, <u>outL3</u>, <u>outR1</u>, and <u>outR2</u> (see Figure 6). Phase 1 (LU decomposition by the Bareiss algorithm) Phase 2 (Back substitution to solve triangular system) Figure 5. Data flow for systolic Toeplitz solver. Figure 6. Systolic processor for Toeplitz systems. Initialization is as follows: $\alpha_k := t_{-(k+1)}$ ; $\beta_k := t_k$ ; $\gamma_k := t_{-k}$ ; $\delta_k := t_{k+1}$ ; $\lambda_k := 0$ ; $\mu_k := 0$ ; $\xi_k := b_{n-k-1}$ ; $\eta_k := b_{n-k}$ ; all for $0 \le k \le n$ (we assume that $t_{-(n+1)} = t_{n+1} = b_{-1} = 0$ to cover end-conditions). Clearly this can be done in time 0(n) if T and b are available at either end of the systolic array. We present the program executed by processor $S_k$ $(0 \le k \le n)$ at time step $\tau$ $1 \le \tau \le 4n$ ) in Figure 7. The final solution $\mathbf{x}$ is given by $x_k = \xi_k$ , where $\xi_k$ is stored in register $\xi$ of processor $S_k$ after step 4n. What follows are some observations concerning the program: - 1. Processor $S_k$ is active only if $k < \tau < 2n k$ (Phase 1) or $2n + k \le \tau \le 4n k$ (Phase 2). It is assumed that $S_k$ knows its index k and the current value of $\tau$ (though this could be avoided by the use of 1-bit systolic control paths). - 2. Pairs of adjacent processors could be combined, since only one processor of each pair is active at each time step. This would increase the mean processor utilization from 25% to 50% (see observation I above). - 3. Processor S<sub>0</sub> performs floating-point divisions, other processors perform only additions and multiplications. A time step has to be long enough for six floating-point additions and multiplications, plus data transfers, during Phase 1 (less during Phase 2). - 4. The Bareiss algorithm requires $4 \cdot 5n^2$ multiplications as given, but a simple modification (transmitting $1+\lambda \mu$ ) will give time 4n if we have $\lceil \frac{n}{2} \rceil$ processors (see observation 2), each with six multiply-add units. The corresponding figures for the symmetric Bareiss algorithm for symmetric matrices (see Section 7) are $4n^2$ , 4n, and 5. - 5. An alternative for Phase 2 is the use of the Gohberg-Semencul formula [11]. But the formula is more expensive in terms of both operations and time, and it also fails to take advantage of the possible band structure of the matrix. BRENT AND LUK: A SYSTOLIC ARRAY floating-point computations, and writes to its output lines outL1, ..., outR2. 6. Processor $S_k$ typically reads its input lines inL1,..., inR3, does some tional lines (e.g. inL1 and outL1 could be combined). Hence, pairs of input and output lines could be combined into single bidirec- ``` else if even (\tau + k) and (\tau \ge 2n + k) and (\tau \le 4n - k) then {Phase 2-back substitution} if odd (\tau + k) and (\tau > k) and (\tau < 2n - k) then {Phase 1-LU factorization} {program for processor k at time step \tau, 0 \le k \le n, 1 \le \tau \le 4n} outR1 :=\xi; outR2 := \delta outL1 := \lambda; outL2 := \mu; outL3 := \eta; {ignore if k = 0} \beta := \beta + \lambda * \delta; outL1 := \alpha; outL2 := \delta; outL3 := \xi; {ignore outL1-3 if k = 0} outR1 := \lambda; outR2 := \mu {ignore outR1-2 if k = n} if k = 0 then begin \xi := \eta/\beta; \delta := \mu * \beta end else if \tau > 2n + k then begin \lambda := \text{inR1}; \mu := \text{inR2}; \eta := \text{inR3} end; if k = 0 then {compute multiplier} \mu := \delta/\beta else \beta := \beta - \lambda * \delta; \eta := \eta - \lambda * \xi if k = 0 then {compute multiplier} \lambda := \alpha / \gamma else if \tau > k + 1 then {accept inputs from processor k + 1} \eta := \eta - \beta * \xi; \delta := \delta + \mu * \beta \xi := inL1; \delta := inL2; \mu_*\pi \delta := \delta - \mu * \beta; \alpha := \alpha - \lambda * \gamma \gamma := \gamma - \mu * \alpha; \lambda := \text{inL1}; \mu := \text{inL2}; begin begin {accept multipliers from processor k - 1} begin \alpha := \text{inR1}; \delta := \text{inR2}; \xi := \text{inR3} end; \{ignore\ if\ k=n\} ``` Figure 7. Systolic processor for Toeplitz equation solver. ### 7. SYMMETRIC TOEPLITZ MATRICES algorithm [2]. This symmetric algorithm is the same in its LU-factorization metric Bareiss algorithm is same as (2.5) except that phase as the procedure proposed by S. Y. Kung and Hu ([16], [17]). The sym-For a symmetric matrix T we may use a symmetric version of the Bareiss $$m_{-i} = \frac{t_{i,0}^{(i-1)}}{t_{0,0}^{(i-1)}},$$ $$m_{i} = \frac{t_{0,i}^{(i-1)}}{t_{n,n}^{(i-1)}},$$ $$T^{(i)} = T^{(i-1)} - m_{i} Z_{i} T^{(1-i)}.$$ (7.1) rithm becomes trickier and involves division by $(1-\lambda^2)$ , which may be undesirsome storage and communication cost. The penalty is that Phase 2 of the algoable from a numerical point of view. By symmetry, we get $\alpha = \delta$ , $\beta = \gamma$ and $\lambda = \mu$ (cf. [2] and [26]), and so we save With the initialization: $$\alpha_k = \begin{cases} t_{k+1} & \text{if } 0 \leq k < n, \\ 0 & \text{if } k = n \end{cases}$$ $$\beta_k = t_k & 0 \leq k \leq n,$$ $$\lambda_k = 0 & \cdots$$ $$\mu_k = 0 & \cdots$$ $$\xi_k = \begin{cases} b_{n-k-1} & \text{if } 0 \leq k \leq n, \\ 0 & \text{if } k = n \end{cases}$$ $$\eta_k = b_{n-k} & 0 \leq k < n,$$ We present a program for processor $S_k$ in Figure 8 ``` if odd (\tau + k) and (\tau > k) and (\tau < 2n - k) then {Phase 1 - LU factorization} else if even (\tau + k) and (\tau \ge 2n + k) and (\tau \le 4n - k) then {Phase 2 - back substitution} {program for processor k at time step \tau, 0 \le k \le n, 1 \le \tau \le 4n} if k = 0 then {compute multiplier} if \tau > k + 1 then {accept inputs from processor k + 1} outR1 := \lambda outL1 := \alpha; outL2 := \xi; {ignore if k=0} outR1 := \xi; outR2 := \alpha outL1 := \lambda; outL2 := \eta; \alpha := (\alpha + \lambda * \beta) / ((1 - \lambda) * (1 + \lambda)); if \tau > 2n + k then \beta := \beta + \lambda * \alpha; if k = 0 then begin \alpha := inR1, \xi := inR2 begin {accept multiplier from processor k-1} \lambda := \alpha / \beta; \beta := \beta - \lambda * \alpha; \eta := \eta - \lambda * \beta \xi := \text{inL1: } \alpha := \text{inL2: } \eta := \eta - \beta * \xi \xi := \eta/\beta; \alpha := 0 \lambda := inR1; \eta := inR2 I * \chi - \dot{\xi} =: \dot{\xi} II := \eta; \eta := \eta - \lambda * \xi; \beta := \beta - \lambda * \Pi; {i.e. \beta := \beta(1 - \lambda^2) - \lambda \alpha II := \alpha; \alpha := \alpha - \lambda * \beta; {II is temporarily local to processor k} \{ignore if k = n\} ``` Figure 8. Systolic processor for symmetric Toeplitz equation solver. #### ACKNOWLEDGEMENT The authors would like to thank C. Pottle for pointing out reference [16]. #### REFERENCES - [1] H. A. Ahmed, J-M. Delosme, and M. Morf, Highly concurrent computing structures for matrix arithmetic and signal processing, *IEEE Computer 15* (1982), 65–82. - [2] E. H. Barciss, Numerical solution of linear equations with Toeplitz and vector Toeplitz matrices, Numer. Math. 13 (1969), 404-424. - [3] R. R. Bitmead and B. D. O. Anderson, Asymptotically fast solution of Toeplitz and related - systems of linear equations, Lin. Alg. Applics. 34 (1980), 103-116. [4] R. P. Brent, F. G. Gustavson, and D. Y. Y. Yun, Fast solution of Toeplitz systems of equations and computation of Pade approximants, J. Algorithms 1 (1980), 259-295. - [5] R. P. Brent and F. T. Luk, A systolic architecture for almost linear-time solution of the symmetric eigenvalue problem, Technical Report TR-CS-82-10, Department of Computer Science, Australian National University (1982). - [6] A. Buckley, On the solution of certain skew symmetric linear systems, SIAM J. Numer. Anal. 14 (1977), 566-570. - [7] G. Cybenko, The numerical stability of the Levinson-Durbin algorithm for Toeplitz systems of equations, SIAM J. Sci. Statist. Comput. 1 (1980), 303-320. - [8] P. Delsarte, Y. Genin, and Y. Kamp, Schur parameterization of positive definite block- - Toeplitz systems, SIAM J. Appl. Math 36 (1979), 34-36. [9] P. Dewilde, A. Vicira, and T. Kailath, On a generalized Szegu-Levinson realization algorithm for optimal linear predictors based on a network synthesis approach, *IEEE Trans. Circuits and Systems CAS-25* (1978), 663–675. - [10] J. Durbin, The fitting of time-series models, Rev. Inst. Internat. Statist. 28 (1960), 233-244. - [12] J. Grear and A. Sanreh, On certain parallel linear system solvers, SIAM J. Sci. Statist. Com-[11] I. C. Gohberg and A. A. Semencul, On the inversion of finite Teoplitz matrices and their continuous analogs, *Mat. Listed 2* (1972), 201 –233. (In Russian) - put. 2 (1981), 238-256. [13] J. H. Justice, The Szego recursion relation and inverses of positive definite Toeplitz matrices, SIAM J. Math. Anal. 5 (1974), 503-508. - [14] T. Kailath, A. Vieira, and M. Morf, Inverses of Toeplitz operators, innovations, and orthog- - onal polynomials, SIAM Review 20 (1978), 106-119. - [15] H. T. Kung and C. E. Leiserson, Algorithms for VLSI processor arrays, in *Introduction to VLSI Systems* (by C. Mead and L. Conway), Addison-Wesley, Reading, Massachusetts - [16] S. Y. Kung and Y. H. Hu, Fast and parallel algorithms for solving Toeplitz systems, Proc. Internat. Symp, on Mini- and Micro-computers in Control and Measurement, San Francisco, California (May 1981). - [17] S. Y. Kung, Impact of VLSI on modern signal processing, Proc. USC Workshop on VLSI and Modern Signal Processing, Los Angeles, California (November 1982), 123–132. - [18] H. Lev-Ari and T. Kailath, Schur and Levinson algorithms for nonstationary processes, Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing, New Jersey (1981), 860-864. - [19] N. Levinson, The Wiener RMS (root mean square) error criterion in filter design and predic tion, J. Math. Phys. 25 (1947), 261-278. - [20] M. Morf, Doubling algorithms for Toeplitz and related equations, Proc. IEEE Internat. Conf. on Acoustics, Speech, and Signal Processing, Denver, Colorado (April 1980) - [21] M. Morf and J.M. Delosme, Matrix decompositions and inversions via elementary signature-orthogonal transformations, Proc. Internat. Symp. on Mini- and Micro-computers in Control and Measurement, San Francisco, California (May 1981). [22] J. G. Nash and G. R. Nudd, Concurrent VLSI architectures for two-dimensional signal pro- - cessing systems, Proc. USC Workshop on VLSI and Modern Signal Processing, Los - Angeles, California (November 1982), 166–173. [23] J. Schur, Über Potenzreihen, die im Innern des Einheitskreises beschränkt sind, J. für die Reine und Angewandte Mathematik 147 (1917), 205–232. [24] H. Sexton, An analysis of an algorithm of Bitmead and Anderson for the inversion of Toe- - plitz systems, manuscript. [25] J. M. Speiser and H. J. Whitehouse, Architectures for real-time matrix operations, *Proc.* - Government Microcircuit Applications Conf., Houston, Texas (1980). [26] D. R. Sweet, Numerical methods for Toeplitz matrices, Ph.D. Thesis, Department of Computer Science, University of Adelaide, May 1982. [27] G. Szego, Orthogonal Polynomials, third edition, AMS Colloquium Pub. vol. 23, AMS, - Providence, Rhode Island (1967). - [28] W. F. Trench, An algorithm for the inversion of finite Toeplitz matrices, J. Soc. Indust. Appl. Math 12 (1964), 515-522. [29] S. Zohar, Toeplitz matrix inversion: the algorithm of W. F. Trench, J. Assoc. Comput. Mach. 16 (1969), 592-601.