File Compression

I. Introduction
The main objective of this page is to introduce two important lossless compression algorithms: Huffman Coding and Lempel-Ziv Coding. A Huffman encoder takes a block of input characters withfixed length and produces a block of output bits of variable length. It is a fixed-to-variable length code. Lempel-Ziv, on the other hand, is a variable-to-fixed length code. The design of the Huffman code is optimal (for a fixed blocklength) assuming that the source statistics are known a priori. The Lempel-Ziv code is not designed for any particular source but for a large class of sources. Surprisingly, for any fixed stationary and ergodic source, the Lempel-Ziv algorithm performs just as well as if it was designed for that source. Mainly for this reason, the Lempel-Ziv code is the most widely used technique for lossless file compression.
II. Huffman Coding
The basic idea in Huffman coding is to assign short codewords to those input blocks with high probabilities and long codewords to those with low probabilities. This concept is similar to that of theMorse code.

A Huffman code is designed by merging together the two least probable characters, and repeating this process until there is only one character remaining. A code tree is thus generated and the Huffman code is obtained from the labeling of the code tree. An example of how this is done is shown below.

Source:data-compression.com,howstuffworks.com

Huffman Design AnimationThe final static code tree is given below:

Huffman Design Animation

  • It does not matter how the characters are arranged. I have arranged it above so that the final code tree looks nice and neat.
  • It does not matter how the final code tree are labeled (with 0s and 1s). I chose to label the upper branches with 0s and the lower branches with 1s.
  • There may be cases where there is a tie for the two least probable characters. In such cases, any tie-breaking procedure is acceptable.
  • Huffman codes are not unique.
  • Huffman codes are optimal in the sense that no other lossless fixed-to-variable length code has a lower average rate.
  • The rate of the above code is 2.94 bits/character.
  • The entropy lower bound is 2.88 bits/character.

III. Lempel-Ziv Coding
The Lempel-Ziv algorithm is a variable-to-fixed length code. Basically, there are two versions of the algorithm presented in the literature: the theoretical version and the practical version. Theoretically, both versions perform essentially the same. However, the proof of the asymptotic optimality of the theoretical version is easier. In practice, the practical version is easier to implement and is slightly more efficient. We explain the practical version of the algorithm as explained in the book by Gersho and Gray and in the paper by Welch.

The basic idea is to parse the input sequence into non-overlapping blocks of different lengths while constructing a dictionary of blocks seen thus far.

Encoding Algorithm

  1. Initialize the dictionary to contain all blocks of length one (D={a,b}).
  2. Search for the longest block W which has appeared in the dictionary.
  3. Encode W by its index in the dictionary.
  4. Add W followed by the first symbol of the next block to the dictionary.
  5. Go to Step 2.

The following example illustrates how the encoding is performed.

Lempel-Ziv Encoding Animation
Click here to see the animation 
(it takes about 2 minutes to loop through the animation)

  • Theoretically, the size of the dictionary can grow infinitely large.
  • In practice, the dictionary size is limited. Once the limit is reached, no more entries are added. Welch had recommended a dictionary of size 4096. This corresponds to 12 bits per index.
  • The length of the index may vary. For example, in the n-th block, the current dictionary size is n+1 (assuming that the limit has not been reached). Therefore, we can encode the index of block n using [log2(n+1)] bits (rounded up to the nearest integer). For example, the first index can be represented by 1 bit, the 2nd and 3rd by 2 bits each, the 4th, 5th, 6th, and 7th by 3 bits each, and so on. This is the variable-to-variable length version of the Lempel-Ziv algorithm.
  • For a maximum dictionary size of 2m, variable-length encoding of the indices saves exactly 2m-1 bits compared to fixed-length encoding.
  • The above example, as most other illustrative examples in the literature, does not result in real compression. Actually, more bits are used to represent the indices than the original data. This is because the length of the input data in the example is too short.
  • In practice, the Lempel-Ziv algorithm works well (lead to actual compression) only when the input data is sufficiently large and there are sufficient redundancy in the data.
  • Decompression works in the reverse fashion. The decoder knows that the last symbol of the most recent dictionary entry is the first symbol of the next parse block. This knowledge is used to resolve possible conflict in the decoder.
  • Many popular programs (e.g. Unix compress and uncompress, gzip and gunzip, and Windows WinZip) are based on the Lempel-Ziv algorithm.
  • The name “Ziv-Lempel” is more appropriate than “Lempel-Ziv” (see Gersho and Gray and the name ordering in References 3 and 4 below).

IV. Notes

  • The following table compares an adaptive version of the Huffman code (implemented in the Unix “compact” program) and an implementation of the Lempel-Ziv algorithm (Unix “compress” program).
    Adaptive Huffman Lempel-Ziv
    LaTeX file 66% 44%
    Speech file 65% 64%
    Image file 94% 88%
    Size of compressed file as percentage of the original file
  • The large text file described in the Statistical Distributions of English Text (containing the seven classic books with a 27-letter English alphabet) has a compression ratio of 36.3% (original size=5086936 bytes, compressed size=1846919, using the Linux “gzip” program). This corresponds to a rate of 2.9 bits/character — compared with the entropy rate of 2.3 bits/character predicted by Shannon. This loss of optimality is most likely due to the finite dictionary size.

File Compression-how it works

If you download many programs and files off the Internet, you’ve probably encountered ZIP files before. This compression system is a very handy invention, especially for Web users, because it lets you reduce the overall number of bits and bytes in a file so it can be transmitted faster over slower Internet connections, or take up less space on a disk. Once you download the file, your computer uses a program such as WinZip or Stuffit to expand the file back to its original size. If everything works correctly, the expanded file is identical to the original file before it was compressed.

Most types of computer files are fairly redundant — they have the same information listed over and over again. File-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.

As an example, let’s look at a type of information we’re all familiar with: words.

In John F. Kennedy’s 1961 inaugural address, he delivered this famous line:

“Ask not what your country can do for you — ask what you can do for your country.”

The quote has 17 words, made up of 61 letters, 16 spaces, one dash and one period. If each letter, space or punctuation mark takes up one unit of memory, we get a total file size of 79 units. To get the file size down, we need to look for redundancies.

Immediately, we notice that:

  • “ask” appears two times
  • “what” appears two times
  • “your” appears two times
  • “country” appears two times
  • “can” appears two times
  • “do” appears two times
  • “for” appears two times
  • “you” appears two times

Ignoring the difference between capital and lower-case letters, roughly half of the phrase is redundant. Nine words — ask, not, what, your, country, can, do, for, you — give us almost everything we need for the entire quote. To construct the second half of the phrase, we just point to the words in the first half and fill in the spaces and punctuation.

Redundancy and Algorithms

Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. “LZ” refers to Lempel and Ziv, the algorithm’s creators, and “dictionary” refers to the method ofcataloging pieces of data.

The system for arranging dictionaries varies, but it could be as simple as a numbered list. When we go through Kennedy’s famous words, we pick out the words that are repeated and put them into the numbered index. Then, we simply write the number instead of writing out the whole word.

So, if this is our dictionary:

  1. ask
  2. what
  3. your
  4. country
  5. can
  6. for
  7. you

Our sentence now reads:  “1 not 2 3 4 5 6 7 8 — 1 2 8 5 6 7 3 4”

If you knew the system, you could easily reconstruct the original phrase using only this dictionary and number pattern. This is what the expansion program on your computer does when it expands a downloaded file. You might also have encountered compressed files that open themselves up. To create this sort of file, the programmer includes a simple expansion program with the compressed file. It automatically reconstructs the original file once it’s downloaded.

But how much space have we actually saved with this system? “1 not 2 3 4 5 6 7 8 — 1 2 8 5 6 7 3 4” is certainly shorter than “Ask not what your country can do for you; ask what you can do for your country;” but keep in mind that we need to save the dictionary itself along with the file.

In an actual compression scheme, figuring out the various file requirements would be fairly complicated; but for our purposes, let’s go back to the idea that every character and every space takes up one unit of memory. We already saw that the full phrase takes up 79 units. Our compressed sentence (including spaces) takes up 37 units, and the dictionary (words and numbers) also takes up 37 units. This gives us a file size of 74, so we haven’t reduced the file size by very much.

But this is only one sentence! You can imagine that if the compression program worked through the rest of Kennedy’s speech, it would find these words and others repeated many more times. And, as we’ll see in the next section, it would also be rewriting the dictionary to get the most efficient organization possible.

In our previous example, we picked out all the repeated words and put those in a dictionary. To us, this is the most obvious way to write a dictionary. But a compression program sees it quite differently: It doesn’t have any concept of separate words — it only looks for patterns. And in order to reduce the file size as much as possible, it carefully selects which patterns to include in the dictionary.

If we approach the phrase from this perspective, we end up with a completely different dictionary.

If the compression program scanned Kennedy’s phrase, the first redundancy it would come across would be only a couple of letters long. In “ask not what your,” there is a repeated pattern of the letter “t” followed by a space — in “not” and “what.” If the compression program wrote this to the dictionary, it could write a “1” every time a “t” were followed by a space. But in this short phrase, this pattern doesn’t occur enough to make it a worthwhile entry, so the program would eventually overwrite it.

The next thing the program might notice is “ou,” which appears in both “your” and “country.” If this were a longer document, writing this pattern to the dictionary could save a lot of space — “ou” is a fairly common combination in the English language. But as the compression program worked through this sentence, it would quickly discover a better choice for a dictionary entry: Not only is “ou” repeated, but the entire words “your” and “country” are both repeated, and they are actually repeated together, as the phrase “your country.” In this case, the program would overwrite the dictionary entry for “ou” with the entry for “your country.”

The phrase “can do for” is also repeated, one time followed by “your” and one time followed by “you,” giving us a repeated pattern of “can do for you.” This lets us write 15 characters (including spaces) with one number value, while “your country” only lets us write 13 characters (with spaces) with one number value, so the program would overwrite the “your country” entry as just “r country,” and then write a separate entry for “can do for you.” The program proceeds in this way, picking up all repeated bits of information and then calculating which patterns it should write to the dictionary. This ability to rewrite the dictionary is the “adaptive” part of LZ adaptive dictionary-based algorithm. The way a program actually does this is fairly complicated, as you can see by the discussions on Data-Compression.com.

No matter what specific method you use, this in-depth searching system lets you compress the file much more efficiently than you could by just picking out words. Using the patterns we picked out above, and adding “__” for spaces, we come up with this larger dictionary:

  1. ask__
  2. what__
  3. you
  4. r__country
  5. __can__do__for__you 

And this smaller sentence: “1not__2345__–__12354”

The sentence now takes up 18 units of memory, and our dictionary takes up 41 units. So we’ve compressed the total file size from 79 units to 59 units! This is just one way of compressing the phrase, and not necessarily the most efficient one. (See if you can find a better way!)

So how good is this system? The file-reduction ratio depends on a number of factors, including file type, file size and compression scheme.

In most languages of the world, certain letters and words often appear together in the same pattern. Because of this high rate of redundancy, text files compress very well. A reduction of 50 percent or more is typical for a good-sized text file. Most programming languages are also very redundant because they use a relatively small collection of commands, which frequently go together in a set pattern. Files that include a lot of unique information, such as graphics or MP3 files, cannot be compressed much with this system because they don’t repeat many patterns (more on this in the next section).

If a file has a lot of repeated patterns, the rate of reduction typically increases with file size. You can see this just by looking at our example — if we had more of Kennedy’s speech, we would be able to refer to the patterns in our dictionary more often, and so get more out of each entry’s file space. Also, more pervasive patterns might emerge in the longer work, allowing us to create a more efficient dictionary.

This efficiency also depends on the specific algorithm used by the compression program. Some programs are particularly suited to picking up patterns in certain types of files, and so may compress them more succinctly. Others have dictionaries within dictionaries, which might compress efficiently for larger files but not for smaller ones. While all compression programs of this sort work with the same basic idea, there is actually a good deal of variation in the manner of execution. Programmers are always trying to build a better system.

Algorithm Used:Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by Abraham LempelJacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is simple to implement, and has the potential for very high throughput in hardware implementations.[1] It was the algorithm of the widely used Unix file compression utility compress, and is used in the GIF image format.

Lots More Information

http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

More about file compression see https://codeprogs.wordpress.com/file-compression/

Sources: howstuffworks.com


One thought on “File Compression

    [How it Works]File Compression | Codeprogs said:
    May 4, 2013 at 9:48 am

    […] File Compression […]

Leave a comment