A linear algorithm for data compression
98. R. P. Brent, A linear algorithm for data compression,
Australian Computer Journal 19, 2 (May 1987), 64-68.
Presented at the tenth Australian Computer Science Conference,
Deakin University, Geelong, February 1987. Preliminary version
appeared as Technical Report TR-CS-86-10, Computer Sciences Laboratory,
Australian National University, November 1986, 9 pp.
Paper: pdf (796K).
Abstract
We describe an efficient algorithm for data compression.
The algorithm finds maximal common substrings in the input data using
a simple hashing scheme, and repeated substrings are encoded using
Huffman coding. Time and space requirements are proportional to the size
of the input data. A modification which uses a bounded buffer is also
described. The algorithm is simpler than comparable linear-time algorithms
and gives excellent compression ratios on both text and binary files.
Comments
1. The paper presents some comparisons of an implementation (SLH) of
the algorithm with straightforward Huffman coding (HUF),
the "move to front" (MTF) algorithm, and one of the
Ziv-Lempel algorithms (LZ78).
For example, on a typical text file (the Pascal source of the SLH program),
compression ratios were 1.75, 2.29, 2.48 and 3.18 for
HUF, MTF, LZ78 and SLH respectively.
2. The last phase of SLH uses Huffman coding, but this could
(and probably should) be replaced by arithmetic coding
(Witten, Neal and Cleary, Communications of the ACM 30 (1987),
520-541). At the time of writing the paper I was not familiar with
arithmetic coding.
3.
Rod Worley (personal communication, 5 July 1988) pointed out that the
algorithm in the paper does not always find a maximal matching substring.
For example, with
stringmatchesXmatchestrinGstring
entering the right substrings of the italicised portion
matchest after the
match of "matches" overwrites the initial "st" entry. The final "string"
then gets compared with "strinG" and we only obtain a "strin" match and
not a "string" match. Similarly with
matchesmatchesatch
where the final "atch" is matched with the first "atch" and not the second.
In practice such examples seldom arise and have only a small effect
on the compression ratio.
However, it would be nice to overcome such problems
and obtain an algorithm which is linear-time and guarantees to find
the longest matching substrings, without abandoning hashing and
reverting to the algorithms in the style of
Weiner (1973), McCreight (1976), or Rodeh, Pratt and Even (1981)
[references are given in the paper].
Go to next publication
Return to Richard Brent's index page