· 7 min read

The Secret Behind JPEG Compression — Discrete Cosine Transform

# CS
This article was auto-translated from Chinese. Some nuances may be lost in translation.

(This article originated from the 2023 iThome Ironman Contest)

Everyone is familiar with JPG, but there are many compression techniques behind it worth learning from. Seeing how people before us used all kinds of clever engineering tricks to compress images makes me feel that living in this era is truly wonderful.

JPEG

Strictly speaking, JPEG is not actually a file format, but an algorithm. The file format is defined by JFIF (JPEG File Interchange Format).

YCbCr

Let’s start with how JPEG stores color. Generally, we use the three primary colors of light—R, G, and B—to represent various colors, but scientists have found that humans are much more sensitive to changes in brightness than to changes in color.

To make use of this innate human characteristic, we want to reduce color information as much as possible and rely more on brightness variations to represent the image. In image processing, YCbCr is often used to represent color.

Where:

  • Y is luminance
  • Cr is the red chrominance channel
  • Cb is the blue chrominance channel

So where did G go? The answer is: it was discarded. However, even if part of the information is lost, because the human eye is less sensitive to color changes, losing some of that information does not produce a very noticeable difference.

The formulas for converting from RGB to YCbCr are:

Y=0.299R+0.587G+0.114BY = 0.299R + 0.587G + 0.114B Cb=0.169R0.331G+0.500BCb = -0.169R - 0.331G + 0.500B Cr=0.500R0.419G0.081BCr = 0.500R - 0.419G - 0.081B

Discrete Cosine Transform (DCT)

If there is one algorithm behind JPEG that makes images magically compressible, the most important contributor is the Discrete Cosine Transform (DCT).

High and low frequencies in images

To simplify things and make DCT easier to understand, let’s first use a 1D array as an example. If we treat an image as a signal, and use pixel grayscale values as the height on the y-axis, then each row of pixels can be viewed as a segment of a signal.

What’s the benefit of treating an image as a signal? Just like other signals, we can perform frequency-domain analysis on it. Through a linear transform, we can identify low-frequency or high-frequency components.

For the transformed signal, we can interpret it as:

How much each cosine frequency contributes to the image

Or we can think of it this way: “Any image can be considered a combination of cosine waves at different frequencies.” For humans, low frequencies are the important signals that make up an image. We can simply reduce the high-frequency coefficients of the original signal, or even set them to 0, without affecting the original image too much.

If they can be set to 0, then no extra space is needed! That is the core idea of JPEG compression.

2D Discrete Cosine Transform

The 2D DCT formula is more complex, but the underlying principle is still the same. The image is first split into 8×8 blocks and transformed into DCT coefficients.

The formula for 2D DCT is:

F(u,v)=14C(u)C(v)x=07y=07f(x,y)cos(2x+1)uπ16cos(2y+1)vπ16F(u,v) = \frac{1}{4} C(u) C(v) \sum_{x=0}^{7} \sum_{y=0}^{7} f(x,y) \cos\frac{(2x+1)u\pi}{16} \cos\frac{(2y+1)v\pi}{16}

where C(k)=12C(k) = \frac{1}{\sqrt{2}} when k=0k = 0, otherwise C(k)=1C(k) = 1.

The formula looks pretty intimidating, but the idea is basically asking: “Which basis patterns does this 8×8 block resemble, and by how much?”

The DCT coefficient F(u,v) is the dot product of the original image and that basis pattern — “How similar is this image to this pattern?” The more similar they are, the larger the coefficient, and the higher the weight.

Interactive: DCT visualization

The interactive tool below contains visualizations of 64 DCT basis patterns—each (u, v) corresponds to a frequency combination. The top-left corner (0,0) is the DC component (average brightness), and frequency increases toward the bottom right.

You can see which basis patterns are preserved during compression (bright) or discarded (dark); the numbers represent the order of the zig-zag scan.

After uploading an image, you can actually see the DCT transformation result, as well as reconstructed images at different compression levels.

Loading DCT visualization tool…

Quantization Table

If you simply save the DCT coefficient matrix directly, then the original image is not compressed at all.

In practice, the DCT coefficients are divided by the values in a quantization table, then rounded to the nearest integer. This process produces a new DCT coefficient matrix, and most of the high-frequency coefficients become zero due to quantization. The JPG algorithm splits the image into several 8×8 blocks and computes DCT over these blocks.

When choosing the compression ratio, the coefficients in the quantization table will also differ. If there is no compression, all elements in the quantization table will be 1 (every pixel in the image is preserved).

Zig-zag Traversal

After the DCT coefficient matrix is divided by the quantization table, most of the high-frequency elements become 0. The upper-left part of the matrix contains low frequencies, and the farther toward the bottom right, the higher the frequencies. To place low-frequency values first, the matrix is traversed in a zig-zag pattern when compressing the image file.

For example, if the quantized matrix is:

4 5 1 0
1 3 0 0
6 0 0 0
0 0 0 0

then during encoding it is actually converted into:

[4, 5, 1, 1, 3, 6, 9, 5, 0, 0, 0, 0, 0, 0, 0, 0]

The trailing zeros can then be compressed very efficiently.

Huffman Coding

Huffman coding is a lossless data compression algorithm that can encode data very efficiently. When JPG is encoded, a large portion of the elements are 0. Huffman coding assigns shorter codes to symbols that appear more frequently and longer codes to symbols that appear less frequently. In JPG, after the DCT matrix undergoes quantization and zig-zag traversal, it is stored using Huffman coding.

Decoding JPEG in practice

The process of decoding JPEG is basically the reverse of the steps above:

  1. Extract the Huffman coding table from the file and decode the data
  2. Restore the zig-zag order into an 8×8 matrix
  3. Multiply the DCT coefficients by the quantization table to obtain approximate DCT coefficients
  4. Compute the inverse DCT (IDCT) to restore pixel values
  5. Convert YCbCr back to RGB

JPG files contain many segments, which are identified by 0xff markers:

  • 0xffd8 — JPG file start marker (SOI)
  • 0xffc0 — Start of Frame, containing image width, height, and other information
  • 0xffc4 — Huffman coding table (DHT)
  • 0xffdb — Quantization table (DQT)
  • 0xffda — Start of Scan, the actual image data
  • 0xffd9 — JPG file end marker (EOI)

Below is a simple JavaScript implementation for parsing JPG markers:

function parseJPG(data) {
  const hfTables = {}
  const quantizationTables = {}
  let quantMapping
  let d = new Uint8Array(data)
  let w, h

  while (d.length > 0) {
    const marker = d.slice(0, 2)

    if (marker[0] === 0xff && marker[1] === 0xd8) {
      d = d.slice(2) // SOI
      continue
    }

    if (marker[0] === 0xff && marker[1] === 0xd9) {
      break // EOI
    }

    const len = (d[2] << 8) + d[3] + 2
    const chunk = d.slice(4, len)

    switch (marker[1]) {
      case 0xc4: { // DHT
        const { table, header } = decodeHuffman(chunk)
        hfTables[header] = table
        break
      }
      case 0xdb: { // DQT
        const { header, table } = buildQuantizationTables(chunk)
        quantizationTables[header] = table
        break
      }
      case 0xc0: { // SOF
        const { result, width, height } = baseDCT(chunk)
        quantMapping = result
        w = width
        h = height
        break
      }
      case 0xda: { // SOS
        startOfScan({ data: d, hdrlen: len, width: w, height: h,
          quantizationTables, quantMapping, hfTables })
        break
      }
    }

    d = d.slice(len)
  }
}

Afterword

In practical applications, GPUs and CPUs already have circuits for image encoding and decoding, so there is usually no need to write your own code to handle it. However, the algorithm behind JPG combines discrete cosine transform, Huffman coding, and zig-zag traversal, successfully reducing image storage space.

I really like how things we usually take for granted reveal so much worth learning once we dig into them. JPEG compression allows us to transmit hundreds of millions of images on the internet every day, and behind it all is the perfect combination of mathematics, information theory, and the characteristics of human vision.