CSci 3501 Algorithms and Computability - Lab 13.

Due Wednesday, December 7 at 11:59pm

What to submit and when:

Lab assignment

Work in pairs

Lab overview and goals

The goal of the lab is to explore Huffman encoding as a greedy algorithm.

Implementing Huffman encoding (30 points)

Huffman encoding is an approach to data encoding that minimizes storage by assigning shorter codes to more frequent letters and longer codes to more rare ones. The encoding is known as prefix-free which means that no letter code is a prefix of any other one. For instance, if the most frequent letter is encoded as 0 then no other letter code starts with 0. Thus decoding is unambiguous at every point. Huffman codes are described in section 16.3 (p. 428) in CLRS and on the wikipedia page Huffman encoding is an optimal encoding (i.e. it produces the shortest encoding of 0s and 1s) for a text that follows the specified letter frequencies.

As an example, consider an alphabet of 5 letters with frequencies given below:

Letter: a b c d e
Frequency, %: 51 22 15 7 5

The algorithms for finding Huffman code forms a binary tree in which nodes represent either a single letter or a combination of letters. Initially all letters are placed in a min-priority queue, where the key is the frequency. The algorithms proceeds by combining two nodes with the least frequencies into one that has the combined frequency, and adding it back into the queue. The process continues until there is only one combined node which is the root of the trree. Then each left branch in the tree is denoted by 0 and the right branch by 1. The codes for each letter are the paths to them in the tree.
The textbook or the compression section of the wikipedia page provide more details.

The encoding for the example above is:

Letter: a b c d e
Encoding: 1 00 011 0101 0100

As an exercise, encode the strings ace and add according to the above encoding. Then decode the strings 101011 and 0010101. Note how prefix-free encoding makes decoding easy.

Tasks for the lab

Your goal is to write a program that reads in an alphabet with letter frequencies and produces Huffman codes for this letters. It then allows you to encode and decode texts in this alpahbet using the Huffman encoding you have constructed. More specifically:

  1. Your program must read data in the format (on one line):
          a 51 b 22 c 15 d 7 e 5
    The frequences are assumed to be in decreasing order and add up to 100%. New line signals end of input.
  2. It should output encodings for each letter.
  3. It should then take inputs in the format:
          e add
    Where the first letter is "e" for "encode" and "d" for "decode", and the following word is what you would like to encode or decode. Spaces and other symbols outside the alphabet should be ignored.
  4. Entering letter x on the line by itself terminates the program.

When submitting your program, please explain what its efficiency is for finding Huffman codes (where n is the number of letters in the alphabet). Don't forget to include the efficiency of data structures that you are using, if any.

What to submit

CSci 3501 course web site.

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Minnesota.