Huffman Coding — Saying More with the Fewest Bits

We send images, audio, and text over the internet every day, and these files are almost always compressed before transmission. How does compression actually work? Why can a compressed file be restored to its original form?

In the previous article about JPEG compression, we mentioned that Huffman coding is the final step in the JPEG compression pipeline. In this article, we want to take a deeper look at Huffman coding itself.

Frequency and Encoding

Suppose you want to send a piece of text that contains only four letters: A, B, C, and D. The most intuitive approach is to use fixed-length binary codes:

Character	Encoding
A	00
B	01
C	10
D	11

Each character takes 2 bits, simple and clear. But what if A appears 1000 times, while D appears only once? Using codes of the same length to represent characters with such different frequencies feels a bit wasteful.

This is the problem Huffman coding aims to solve:

Use shorter codes for characters that appear more frequently, and longer codes for those that appear less frequently.

This idea is very similar to Morse code — in English, the most common letter, E, is represented by just a single dot ., while the less common Q takes four symbols --.-.

Information Content and Entropy

Before discussing compression, let’s first ask: how should we measure the “information content” of an event?

Suppose someone tells you “the sun will rise in the east tomorrow.” You wouldn’t feel like you learned anything new — this is almost certain to happen. But if someone says “it will hail tomorrow,” you would be very surprised, because the probability of that happening is extremely low.

In 1948, Claude Shannon proposed a mathematical definition of information content. For an event with probability of occurrence $p$ , its information content is:

I = -\log_2(p)

The lower the probability, the greater the information content. That’s intuitive — the more unexpected something is, the more information it carries. The entropy of an information source is the average information content of all symbols:

H = -\sum_{i} p_i \log_2(p_i)

Entropy tells us: for this information source, theoretically, how many bits are needed on average to encode each symbol at minimum. The power of Huffman coding is that it can get very close to this theoretical lower bound.

The Huffman Coding Algorithm

David A. Huffman proposed this algorithm in 1952. At the time, he was a graduate student at MIT, and this was a class assignment — Professor Robert Fano posed the question: find the most efficient binary coding method. Instead of following Fano and Shannon’s top-down approach, Huffman built the solution from the bottom up.

The algorithm is quite easy to understand, simple to implement, yet astonishingly effective.

Steps

Count frequencies: Calculate how many times each character appears
Build a Priority Queue: Sort all characters from lowest frequency to highest
Merge repeatedly: Take out the two nodes with the smallest frequencies, merge them into a new node (with frequency equal to the sum of the two), and put it back into the queue
Repeat until only one remains: This node becomes the root of the Huffman tree
Generate the code table: Traverse from the root; going left means 0, going right means 1

A Worked Example

Let’s use the string "aabbbcccc" as an example:

Step 1 — Count frequencies:

a: 2, b: 3, c: 4

Step 2 — Take the two smallest and merge them:

a(2) + b(3) → ab(5)

Step 3 — Take the two smallest again:

c(4) + ab(5) → root(9)

Final Huffman tree:

        (9)
       /   \
     0       1
    c(4)   (5)
           /   \
         0       1
        a(2)   b(3)

Code table:

Character	Encoding	Length
c	0	1
a	10	2
b	11	2

Original with ASCII encoding: $9 \times 8 = 72$ bits. After Huffman coding: $4 \times 1 + 2 \times 2 + 3 \times 2 = 14$ bits.

Why Huffman Coding Works So Well

Huffman coding is a prefix code. A prefix code means that no code is the prefix of another code.

For example, if A = 0 and B = 01, then when the decoder reads 01, it cannot tell whether this is “A followed by a 1” or “B”.

Huffman coding guarantees that this never happens, because all characters are at leaf nodes — the path from the root to any leaf cannot be a prefix of another path. This property also makes decoding very simple. Among all prefix codes, Huffman coding produces the shortest average code length.

Code Implementation

Here is the JavaScript implementation:

function buildTree(freqMap) {
  const nodes = [...freqMap.entries()]
    .map(([char, freq]) => ({ char, freq, left: null, right: null }))

  while (nodes.length > 1) {
    nodes.sort((a, b) => a.freq - b.freq)
    const left = nodes.shift()
    const right = nodes.shift()
    nodes.push({
      char: null,
      freq: left.freq + right.freq,
      left,
      right,
    })
  }

  return nodes[0]
}

function buildCodeTable(node, prefix = '', table = {}) {
  if (node.char !== null) {
    table[node.char] = prefix || '0'
    return table
  }
  buildCodeTable(node.left, prefix + '0', table)
  buildCodeTable(node.right, prefix + '1', table)
  return table
}

buildTree repeatedly takes out the two smallest nodes and merges them until only one root node remains. buildCodeTable then recursively traverses from the root, adding 0 for each left move and 1 for each right move.

The decoding process is equally intuitive — start from the root node, move left when you read 0, move right when you read 1, output the corresponding character when you reach a leaf node, then return to the root and continue:

function decode(encoded, root) {
  let result = ''
  let current = root

  for (const bit of encoded) {
    current = bit === '0' ? current.left : current.right
    if (current.char !== null) {
      result += current.char
      current = root
    }
  }

  return result
}

Interactive: Huffman Encoder

Try entering any text below and watch the Huffman encoding result in real time. You can hover your mouse over a character in the code table, and the corresponding bits will be highlighted in the encoded output.

Loading Huffman encoder…

Try entering different text and observe a few interesting behaviors:

If you enter text with only one type of character (such as aaaa), each character needs only 1 bit
If you enter text where all characters occur with the same frequency, the code length approaches a fixed length
The greater the frequency differences in the text, the higher the compression ratio

Limitations of Huffman Coding

Huffman coding is not万能, and it has some practical limitations:

At least 1 bit per symbol. Even if the theoretically optimal code length for a character is 0.3 bits, Huffman coding can still only assign at least 1 bit. This can cause efficiency loss under certain extreme distributions.

The frequencies must be known in advance. Before encoding, the data must be scanned once to count frequencies, or a predefined frequency table must be used. In JPEG, the standard defines default Huffman tables and also allows custom ones.

The code table must be transmitted along with the data. The decoder needs to know the code table in order to restore the data, so the table itself takes up space. For small data, the overhead of the table may offset the benefits of compression.

Where Is Huffman Coding Used?

JPEG — the final step in image compression is Huffman coding
DEFLATE (ZIP, gzip) — combines LZ77 and Huffman coding
MP3 — Huffman coding is also used in audio compression
HTTP/2 — HPACK header compression uses a static Huffman table

Epilogue

Huffman coding is one of those algorithms that feels obvious once you understand it. Its core idea is extremely simple — represent common things with short codes, and rare things with long ones. But turning that idea into a general-purpose algorithm required quite a bit of clever design.

More than seventy years later, this algorithm is still alive and well in the formats we use every day.

I wonder what these people would become with AI assistance?

Categories

Pages

Side Projects