Write a JSON parser from scratch (2)

In part1, we mentioned how to write a JSON parser and implemented the string parsing functionality. Now let's add the remaining functions. (In fact, once you understand the basic principles, implementing the remaining functions is just following the same pattern.)

Number

json-grammer

Implementing the number function is not difficult. The tricky parts are handling decimal points, negative numbers, floating-point numbers, and exponential notation (e.g., 1e6). (By the way, I just realized that E can also be used.)

function number(parser) {
  let str = "";
  if (parser.current() === "-") {
    str += "-";
    parser.index += 1;
  }

  let curr = "";
  while (((curr = parser.current()), curr >= "0" && curr <= "9")) {
    str += curr;
    parser.index += 1;
  }

  let isFloat = false;
  // float number
  if (parser.next(".")) {
    str += ".";
    isFloat = true;
    while (((curr = parser.current()), curr >= "0" && curr <= "9")) {
      str += curr;
      parser.index += 1;
    }
  }

  // exponential expression
  let expo = "";
  if (parser.next("e")) {
    curr = "";
    if (parser.next("-")) {
      expo += "-";
    }

    while (((curr = parser.current()), curr >= "0" && curr <= "9")) {
      expo += curr;
      parser.index += 1;
    }
  }

  if (expo) {
    return isFloat
      ? parseFloat(str, 10) * Math.pow(10, +expo)
      : parseInt(str, 10) * Math.pow(10, +expo);
  }

  return isFloat ? parseFloat(str, 10) : parseInt(str, 10);
}

In the first part, we check if it is a negative number.
In the second part, we run a while loop to extract the number part.
In the third part, we check if there is a decimal point.
- If there is, we extract the decimal part.
In the fourth part, we check if there is an exponential expression (using uppercase or lowercase e).
- If there is, we extract the exponential part.
In the fifth part, we convert the string to a number (using parseInt or parseFloat).

Keywords (true, false, null)

function keyword(parser) {
  if (parser.next("true")) {
    return true;
  } else if (parser.next("false")) {
    return false;
  } else if (parser.next("null")) {
    return null;
  }
}

This part is simple. We check if the value matches any of the keywords.

Array

json-grammer

function array(parser) {
  const arr = [];

  if (parser.current() === "[") {
    parser.next("[");
    parser.skip();

    if (parser.next("]")) {
      return arr;
    }
    let i = 0;
    while (parser.current()) {
      const val = value(parser);
      arr.push(val);

      parser.skip();

      if (parser.current() === "]") {
        parser.next("]");
        return arr;
      }
      parser.next(",");
      parser.skip();
    }
  }

  return arr;
}

In the first part, we check if it starts with [.
- If we encounter ], it means it is an empty array.
In the second part, we run a while loop and call the value function to parse and add values to the array.
If we encounter ], it means the array ends, so we return the array.
If we encounter a comma, it means there is another element, so we continue parsing.

This is almost it. You can take a look at the implementation in the Repository to see the code in action. You will notice that one of the tests, specical-character, fails because there might be escape characters in the string. Let's try to implement it.

const escape = {
  '"': '"',
  t: "\t",
  r: "\r",
  "\\": "\\",
};

while (((curr = parser.current()), curr)) {
    if (parser.next('"')) {
      return str;
    } else if (curr === "\\") {
      parser.index += 1;
      const escapeCh = parser.current();
      if (escape[escapeCh]) {
        str += escape[escapeCh];
      }
    } else {
      str += curr;
    }
    parser.index += 1;
  }

We create an escape character table and replace the corresponding characters with their actual values. Here, we only implemented \t and \r. This way, we pass the basic JSON tests 🍻. However, besides the mentioned escape characters, we also need to implement \u, which represents Unicode. This functionality is quite important.

Custom Feature: $template$

Since we are writing our own parser, we can add new syntax! Let's say we want to implement a template feature, where any variable enclosed in $$ will be replaced with the corresponding value from the passed object. For example:

{
  "name": $name$
}

will become:

new Parser(string, { name: 'kalan' }).parse();
// { name: "kalan" }

Implementation

function template(parser) {
  parser.skip();
  if (parser.next("$")) {
    parser.skip();

    if (parser.next("$")) {
      throw new Error("template can not be empty");
    }
    let curr = "";
    let key = "";
    while (((curr = parser.current()), curr)) {
      if (parser.next("$")) {
        return parser.variables[key];
      }
      key += curr;
      parser.index += 1;
    }
  }
}

First, we match $.
Then, we start reading the content until we see $.
When we encounter $, we stop the while loop and replace the template variable with the corresponding variable value, and return the result.

You can check the detailed implementation in the template branch. Take a look at the test results in the test/template folder, too.

Conclusion

By writing our own parser, we can express complex implementations more easily in terms of syntax. We can even extend existing grammars (like JSON in this case) with our desired functionality. Although parsing itself is important and interesting, parsing a language is just the first step. It's like converting JSX into JavaScript code without the help of React, which wouldn't be useful; or transforming SQL into an abstract syntax tree without implementing a database, which would be a bit pointless. The purpose of parsing a language is to facilitate further processing (executing queries, rendering to the DOM).

In fact, there are now many libraries that help you skip the parsing part, such as the famous Bison or PEG.js, which automatically generate a stable parser for you based on a BNF-like syntax, saving you time in parsing and allowing you to focus directly on language implementation.

Our JSON parser in this case doesn't convert to an abstract syntax tree and then generate the final result. So, in the next stage, we will try to parse simple HTML and convert it into a syntax tree, then render it using JavaScript's DOM API.