Write a JSON parser from scratch (1)

Written byKalanKalan
💡

If you have any questions or feedback, pleasefill out this form

This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.

Recursive descent can be described as an intuitive and powerful parsing method.

Today, we'll start with parsing JSON and explain how to build a JSON parser from scratch. Due to the simplicity of JSON's structure, it serves as a great practice exercise. Although parsing syntax may not have much relevance to front-end or regular development, these skills can come in handy if you ever need to design a DSL or create a custom language for specific requirements.

Why Learn Recursive Descent?

With JSON.parse readily available, why bother learning recursive descent? By analyzing syntax simply, you can not only parse JSON data but also implement your own syntax. This technique can be applied in various other contexts.

For example, a typical JSON looks like this:

{
  "name": "kalan",
  "age": 20
}

Suppose you want to introduce a new syntax called "template," where any string wrapped in {} will automatically be replaced with a variable. For instance:

{
  "name": "kalan",
  "age": 20,
	"job": $job$
}

You could parse it using a custom parse function:

parse(json, { job: 'engineer' });

// {
//   "name": "kalan",
//   "age": 20,
//   "job": "engineer"
// }

Or if you want to change the syntax to use a new symbol:

{
  "name" @ "kalan"
  "age" @ 20
}

Today, we’ll write a recursive descent parser that can parse JSON, along with modifications to support the two custom features mentioned above.

What is Recursive Descent?

When discussing recursive descent, it's easy to get lost in a maze of intimidating terms and symbols, such as LL, LR, top-down, non-terminal, etc. Let's try to discuss it in the most straightforward way. First, take a look at the diagram:

json-tree-2

This diagram represents the structure of JSON, which can primarily be composed of key + : + value. The value can further be broken down into strings, booleans, numbers, null, objects, etc.

Notice that if the value is an object, the same rules can be applied recursively, meaning we can go back to the top of the diagram and continue parsing with the same rules.

This continuous cycle is known as recursion, and the method of querying from top to bottom to match rules is called top-down parsing; together, they form recursive descent.

For detailed definitions of each type and syntax, we can refer to the JSON Specification:

Screenshot_2020-05-10 JSON

Let’s start with the simplest example: a single key and value, where the value is of string type.

{
  "name": "kalan"
}

json-simple

Here's the process (refer to the diagram above):

  • Upon encountering {, start object parsing.
  • Enter key parsing, expecting a string.
  • Upon encountering ", start string parsing.
  • When encountering n, a, m, e: treat name as the key.
  • Upon encountering ", finish string matching, expecting : to follow the rules.
  • Upon encountering :, enter value parsing.
  • Upon encountering ", enter string parsing.
  • When encountering k, a, l, a, n: set the value to kalan.
  • Upon encountering ", end string matching.
  • Upon encountering }, finish object matching.
  • Upon encountering an empty string (EOF), end parsing.

From this process, we can see that the parsing process resembles a series of state changes. Every time we encounter a matching value, we follow the rules in the diagram to determine whether to move to the next box. Each box can accept different values. Successfully moving from left to right indicates a successful match.

Now, let’s try implementing this.

A Simple Parser Implementation

A simple parser can be achieved using a pointer, position information, and the number of characters currently read.

We will use the parse function as the entry point, executing the corresponding functions based on the current state.

First, let’s define the functions we will use:

class Parser {
  constructor(raw) {
    this.raw = raw;
    this.index = 0;
  }

  // Current character
  current() {
    return this.raw[this.index];
  }

  // Check if the next string matches str
  // e.g: next('{')
  // Check if the next character is '{'
  // If so, move the index forward
  // and return true
  // This is useful for confirming the state
  next(str) {
    if (this.raw.slice(this.index, this.index + str.length) === str) {
      this.index += str.length;
      return true;
    }

    return false;
  }

  // Skip whitespace, currently only considering " "
  // Not implementing tab or other functionalities
  skip() {
    while (this.raw[this.index] <= " ") {
      this.index += 1;
    }
  }

  parse() {
    const value = object(this); // entry point
    return value;
  }
}

The main entry point can be found in parse, while we define some helper functions, such as skip and next, to help us confirm the current state. Now, let’s look at the implementation of the object function:

function object(parser) {
  const obj = Object.create(null);
  let key = "";
	// If it starts with {, we use the object function to parse
  if (parser.current() === "{") {
    parser.next("{");
    parser.skip();
    
    // If we immediately encounter }
    // it indicates an empty object {}
    if (parser.current() === "}") {
      parser.next("}");
      return obj;
    }
    
    // Read the current character to determine which function to use for parsing
    while (parser.current()) {
      key = string(parser); // key must be a string, so we use string parsing

      parser.skip(); // Skip whitespace
      parser.next(":"); // Expecting :
      obj[key] = string(parser); // Assuming the value is only a string for now
      parser.skip();
      if (parser.current() === "}") { // If encountering }, the object parsing ends, return obj
        parser.next("}");
        return obj;
      }
      parser.next(","); // If there's a comma, it indicates the next key-value pair, continue parsing
      parser.skip();
    }
  }
}
  • If the current character is {

    • If the next character is }, it indicates an empty object, so we return and exit.
  • Enter the while loop, using string to parse the key (since JSON keys must be strings). We will explain the implementation of the string function below.

    • Skip whitespace (as in the following case):
    { "name"         : "value" }
    • Expect : to appear.
    • Parse the value using string (for now, we will temporarily assume it’s only a string and will later add support for other types).
    • Skip whitespace.
    • If encountering }, it indicates the end of the object, so we return obj.
    • If encountering ,, it means we need to continue parsing the next key-value pair.
    • Skip whitespace.

Implementation of string parsing:

function string(parser) {
  parser.next('"'); // If the next character is ", we will parse as a string
  if (parser.next('"')) {
    // { "": "value" } We treat an empty JSON key as invalid
    throw new Error("JSON key is empty.");
  }

  let key = "";
  let curr = "";
  // Continuously read the next character until we hit "
  while (((curr = parser.current()), curr)) {
    if (parser.next('"')) {
      parser.index += 1;
      return key;
    }
    key += curr;
    parser.index += 1;
  }
}
  • Expecting the character to be "
  • If the next character is " it indicates an empty string, which is an invalid key, so it throws an error.
  • Enter the while loop until the character " is read.

If you want to practice, you can refer to the Repo.

Writing a parser manually can be a bit cumbersome, and sometimes there might be minor issues that are overlooked. I recommend writing tests from the start, which can save a lot of manual testing time.

Conclusion

Although we've implemented a recursive descent parser, we did not go through the process of converting it into an AST (Abstract Syntax Tree); instead, we directly converted it into a JavaScript object. This is because the data structure of JSON is relatively simple and does not require this conversion.

Additionally, when implementing a parser, consider a few other perspectives:

  1. To make error messages more user-friendly, perhaps we can record information in a two-dimensional manner using row and column instead of one-dimensional.
  2. Implement more helpful error messages, for example, notifying developers when they forget to add a }, or clearly indicating where a syntax error occurs.

To maintain flexibility, you can also replace symbols with variables (Tokens).

In part 2, we will continue to implement other types, such as numbers, arrays, booleans, null, etc., to make it a more complete JSON parser. We will also begin implementing the two previously mentioned custom features: using @ as a separator and {} as a template.

If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨

Buy me a coffee