JPG 與離散餘弦轉換

Written byKalanKalan
💡

If you have any questions or feedback, pleasefill out this form

This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.

The JPG format is familiar to everyone, yet it hides numerous compression techniques worth learning about. Witnessing how predecessors have employed various clever engineering strategies to compress images makes it truly exciting to live in this era.

JPEG

Strictly speaking, JPEG is not a file format but an algorithm. The file format is defined by JFIF.

Let's start by examining how JPEG stores color.

Scientists have found that humans are much more sensitive to brightness than to color. Therefore, when selecting a color space, YCrCb is used to store brightness.

Discrete Cosine Transform

If JPEG has any algorithm that magically compresses images, the most important contributor behind it is the Discrete Cosine Transform (DCT). How does image compression relate to the Discrete Cosine Transform?

High and Low Frequencies in Images

To simplify and make the Discrete Cosine Transform easier to understand, let's use a 1D array as an example. If we view an image as a signal, using pixel grayscale values as the height on the y-axis, it can be represented like this:

What are the advantages of viewing an image as a signal? Just like other signals, we can process it mathematically, where signals can be represented by polynomials.

The formula for the Discrete Cosine Transform is as follows:

Xk=k=0n1xkcos[πnm(k+12)]X_{k} = \sum_{k=0}^{n-1} x_k \cos \left[\frac{\pi}{n} m \left(k+\frac{1}{2}\right) \right]

Let’s take a look at what happens after processing:

For this formula, we can interpret the transformed signal as

the proportion of various cosine frequencies present in the image

Alternatively, we can think of it as "any image can be synthesized from cosine waves of different frequencies." For humans, low frequencies represent the signals we want; we can comfortably reduce the high-frequency coefficients of the original signal, even setting them to zero without significantly affecting the original image.

2-D Discrete Cosine Transform

The formula for the 2D Discrete Cosine Transform is more complex, but the underlying principle remains the same. The image is first divided into 8x8 blocks, which are then transformed into DCT coefficients.

img

Quantization Table

If we directly save the DCT coefficient matrix, the original image remains uncompressed. In practice, we multiply the DCT by a quantization table to obtain a new DCT coefficient matrix, where most of the high-frequency coefficients are zero. The JPG algorithm splits the image into several 8x8 blocks and calculates the DCT for these blocks.

When choosing the compression ratio, the coefficients in the quantization table will vary. If there's no compression, the elements in the quantization table will all be 1 (retaining every pixel in the image).

Zig-zag

After multiplying the DCT coefficient matrix by the quantization table, most of the high-frequency elements are 0. The upper left of the matrix represents low frequencies, while moving towards the lower right indicates higher frequencies. To prioritize low-frequency numbers during image compression, we traverse the matrix as shown below.

img

For example, if the matrix after multiplication with the quantization table looks like this:

[4569135010000000]\begin{bmatrix} 4 & 5 & 6 & 9 \\ 1 & 3 & 5 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}

Then, during encoding, it would be transformed into:

[4, 5, 1, 1, 3, 6, 9, 5, 0, 0, 0, 0, 0, 0, 0, 0]

This is done for easier compression later on.

Huffman Coding

Huffman coding is a lossless data compression algorithm that effectively encodes strings. In JPG encoding, a significant portion of the elements are 0. Huffman coding assigns shorter codes to frequently occurring characters and longer codes to less common characters. In JPG, the DCT matrix is stored after being encoded with Huffman coding.

Actual JPEG Decoding

The JPEG decoding process unfolds as follows:

  • Extract the Huffman coding table from the file
  • Extract the DCT coefficient matrix from the file
  • Multiply the inverse DCT matrix by the quantization table coefficients to obtain the pixel value for the corresponding image location
  • Convert YCbCr to RGB

Here’s a simple JavaScript implementation that references this article (Understanding and Decoding a JPEG Image using Python). You can view the complete code on Codepen, which renders the decoded results onto a canvas.

Note that this implementation does not use putImageData but reads the JPG file's content and renders it pixel by pixel onto the canvas. (Of course, in actual development, we wouldn’t do it this way.)

function parseJPG(data) {
  const hfTables = {};
  const quantizationTables = {};
  let quantMapping;
  let d = new Uint8Array(data);
  let w;
  let h;
  while (true) {
    const marker = new Uint8Array(d.slice(0, 2));
    // console.log(marker[0].toString(16), marker[1].toString(16));
    if (marker[0] === 0xff && marker[1] === 0xd8) {
      console.log("start of image");
      d = d.slice(2);
    } else if (marker[0] === 0xff && marker[1] === 0xd9) {
      console.log("end of image");
      return;
    } else {
      const lenchunk = d.slice(2, 4);
      let len = (lenchunk[0] << 8) + lenchunk[1] + 2;

      const chunk = d.slice(4, len);

      if (marker[0] === 0xff && marker[1] === 0xc4) {
        // console.log(d, chunk);
        const { table, header } = decodeHuffman(chunk);
        hfTables[header] = table;
      } else if (marker[0] === 0xff && marker[1] === 0xdb) {
        const { header, table } = buildQuantizationTables(chunk);
        quantizationTables[header] = table;
      } else if (marker[0] === 0xff && marker[1] === 0xc0) {
        // start of frame
        const { result, width, height } = baseDCT(chunk);
        quantMapping = result;
        w = width;
        h = height;
      } else if (marker[0] === 0xff && marker[1] === 0xda) {
        // console.log(quantMapping, quantizationTables);
        len = startOfScan({
          data: d,
          hdrlen: len,
          width: w,
          height: h,
          quantizationTables,
          quantMapping,
          hfTables,
        });
      }

      d = d.slice(len);
    }
    if (d.length === 0) {
      break;
    }
  }
}

Within a JPG file, there are many blocks, which are identified by a 0xff marker:

  • 0xffc4 for the Huffman coding table
  • 0xffda for the start of scan
  • 0xffdb for the quantization table
  • 0xffd8 for the JPG file start marker
  • 0xffd9 for the JPG file end marker

After converting the binary to the corresponding data (Huffman coding table, quantization table), the appropriate calculations can yield the matrix!

Upon testing, it was found that the decoding wasn't successful, likely due to issues in the code. If you look at the implementation, you’ll see that the calculations can indeed be quite tedious. Due to time constraints, I’m sharing this for reference!

failure

(The above seems to render the correct colors in some areas, but the rest looks quite green.)

Conclusion

In practical applications, GPUs or CPUs typically have dedicated circuits for image encoding and decoding, so additional programming is usually unnecessary. However, the algorithms behind JPG successfully combine the Discrete Cosine Transform, Huffman coding, and zig-zag traversal to effectively reduce image storage space. I really appreciate how what we often take for granted contains so much valuable knowledge worth exploring.

If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨

Buy me a coffee