Analysis of Cascading Compression Algorithms

Gilbert H. Young1, Kwok-Shing Cheng1
1Department of Computing The Hong Kong Polytechnic University Hung Hom, Kowloon Hong Kong

Abstract

The Huffman coding scheme is a character-based algorithm in which every leaf node represents a character only. In this paper, we study three variations of the Huffman coding scheme for compressing 16-bit Chinese language. Although it is observed that IDC can generate the shortest code length among the three variations, but its empirical compression ratio is below 1.8, which is unsatisfactory. In order to achieve higher compression performance, i.e., compression ratio over 2, word-based compression algorithms should be employed. A possible way to develop word-based algorithms is to use the technique of cascading. Two kinds of algorithms are chosen for cascading. They are LZ algorithms and the Huffman coding scheme. LZ algorithms are used for finding repeating phrases while the Huffman coding scheme is used for encoding the phrases instead of characters. The experimental results show that the cascading algorithm of LZSSPDC outperforms a famous UNIX cascading compressor GZIP by 5\% on average.