What to submit and when:
Work in pairs
The goal of the lab is to explore Huffman encoding as a greedy algorithm.
Huffman encoding is an approach to data encoding that minimizes storage by assigning shorter codes to more frequent letters and longer codes to more frequent ones. The encoding is known as prefix-free which means that no letter code is a prefix of any other one. For instance, if the most frequent letter is encoded as 0 then no other letter code starts with 0. Thus decoding is unambiguous at every point. Huffman codes are described in section 16.3 (p. 428) in CLRS and on the wikipedia page https://en.wikipedia.org/wiki/Huffman_coding Huffman encoding is an optimal encoding (i.e. it produces the shortest encoding of 0s and 1s) for a text that follows the specified letter frequencies.
As an example, consider an alphabet of 5 letters with frequencies given below:
Letter: | a | b | c | d | e |
Frequency, %: | 51 | 22 | 15 | 7 | 5 |
The algorithms for finding Huffman code forms a binary tree in which nodes represent either a single letter or a combination of letters. Initially all letters are placed in a min-priority queue, where the key is the frequency. The algorithms proceeds by combining two nodes with the least frequencies into one that has the combined frequency, and adding it back into the queue. The process continues until there is only one combined node which is the root of the trree. Then each left branch in the tree is denoted by 0 and the right branch by 1. The codes for each letter are the paths to them in the tree.
The textbook or the compression section of the wikipedia page provide more details.
The encoding for the example above is:
Letter: | a | b | c | d | e |
Encoding: | 1 | 00 | 011 | 0101 | 0100 |
As an exercise, encode the strings ace
and add
according to the above encoding. Then decode the strings 101011
and 0010101
. Note how prefix-free encoding makes decoding easy.
Your goal is to write a program that reads in an alphabet with letter frequencies and produces Huffman codes for this letters. It then allows you to encode and decode texts in this alpahbet using the Huffman encoding you have constructed. More specifically:
a 51 b 22 c 15 d 7 e 5The frequences are assumed to be in decreasing order and add up to 100%. New line signals end of input.
e addWhere the first letter is "e" for "encode" and "d" for "decode", and the following word is what you would like to encode or decode. Spaces and other symbols outside the alphabet should be ignored.
When submitting your program, please explain what its eficiency is for finding Huffman codes (where n is the number of letters in the alphabet). Don't forget to include the efficiency of data structures that you are using, if any.
The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Minnesota.