Kalan's Blog

Software Engineer / Taiwanese / Life in Fukuoka

Current Theme light

Recursive descent can be described as an intuitive and powerful parsing method.

Today, we will start with parsing JSON and explain how to build a JSON parser from scratch. JSON has a simple structure, making it suitable for practice. Although parsing syntax is not directly related to frontend or regular development, these techniques can come in handy when designing a DSL or creating a custom language for specific requirements.

Why Learn Recursive Descent?

Why learn recursive descent when we already have JSON.parse? By analyzing the syntax, we can not only parse JSON data but also implement our own syntax and apply these techniques elsewhere.

For example, a typical JSON looks like this:

{
  "name": "kalan",
  "age": 20
}

Suppose we want to introduce a new syntax called templates, where any string enclosed in {} will be automatically replaced with a variable, like this:

{
  "name": "kalan",
  "age": 20,
  "job": $job$
}

By using a customized parse function:

parse(json, { job: 'engineer' });

// {
//   "name": "kalan",
//   "age": 20,
//   "job": "engineer"
// }

Or suppose we want to use a different symbol:

{
  "name" @ "kalan"
  "age" @ 20
}

Today, we will write a recursive descent parser that can parse JSON and modify it to include the two aforementioned custom features.

What is Recursive Descent?

When talking about recursive descent, we often encounter various scary terms and symbols, such as LL, LR, top-down, non-terminal, etc. Let's try to discuss it in the most intuitive way. Let's look at the diagram:

json-tree-2

We represent the structure of JSON in the form of a diagram. We can see that JSON is mainly composed of key + : + value, and the value can be further divided into strings, booleans, numbers, null, objects, etc.

If you notice that when the value is an object, the same rules can be applied again, which means going back to the top of the diagram and continuing the parsing with the same rules.

This recursive looping is called recursion, and the process of matching rules from top to bottom is called top-down. Together, they are known as recursive descent.

For detailed definitions of each type and syntax, we can refer to the JSON specification:

Screenshot_2020-05-10 JSON

https://cdn.kalan.dev/images/4wcuSKDv3vaiaXRhgSnKgu.jpg

Let's start with a simple example: an object with only one key-value pair, where the value is a string.

{
  "name": "kalan"
}

json-simple

The following steps (refer to the diagram):

  • Encountering {, indicating the start of object parsing.
  • Entering key parsing, expecting a string.
  • Encountering ", starting string parsing.
  • Encountering n, a, m, e: treating "name" as the key.
  • Encountering ", ending string matching, expecting : to comply with the rules.
  • Encountering :, entering value parsing.
  • Encountering ", starting string parsing.
  • Encountering k, a, l, a, n: setting the value as "kalan".
  • Encountering ", ending string matching.
  • Encountering }, ending object matching.
  • Encountering an empty string (EOF), ending parsing.

From the process, we can observe that a parsing process resembles a series of state changes. Whenever we encounter a matching value, we follow the rules in the diagram to determine whether to proceed to the next box. Each box can accept different values. If we can successfully go from left to right, it indicates a successful match.

Next, let's try implementing it.

Implementation of a Simple Parser

A simple parser can be achieved using a pointer, position information, and the number of characters read.

We use the parse function as the entry point of the program, executing the corresponding functions based on the current state.

First, let's define the functions we will use:

class Parser {
  constructor(raw) {
    this.raw = raw;
    this.index = 0;
  }

  // Current character
  current() {
    return this.raw[this.index];
  }

  // Check if the matched string after this is str
  // e.g: next('{')
  // Check if the next character is '{'
  // If it is, move the index accordingly
  // and return true to indicate the state
  next(str) {
    if (this.raw.slice(this.index, this.index + str.length) === str) {
      this.index += str.length;
      return true;
    }

    return false;
  }

  // Skip whitespace, currently only considering " "
  // Tab and other functionalities are not implemented
  skip() {
    while (this.raw[this.index] <= " ") {
      this.index += 1;
    }
  }

  parse() {
    const value = object(this); // entry point
    return value;
  }
}

The main entry point is the parse function. We also define some helper functions, such as skip and next, to facilitate checking the current state. Now let's look at the implementation of the object function:

function object(parser) {
  const obj = Object.create(null);
  let key = "";
  // If it starts with '{', it means we need to parse using the object function
  if (parser.current() === "{") {
    parser.next("{");
    parser.skip();

    // If we immediately encounter '}', it represents an empty object {}
    if (parser.current() === "}") {
      parser.next("}");
      return obj;
    }

    // Read the current character and determine which function to use for parsing
    while (parser.current()) {
      key = string(parser); // The key must be a string, so we use the string parser

      parser.skip(); // Skip whitespace characters
      parser.next(":"); // Expecting a ':'
      obj[key] = string(parser); // Assuming the value is always a string for now
      parser.skip();
      if (parser.current() === "}") { // If '}' is encountered, it indicates the end of object parsing, so we return obj
        parser.next("}");
        return obj;
      }
      parser.next(","); // If there is a comma, it means there is another key-value pair, so continue parsing in the while loop
      parser.skip();
    }
  }
}
  • If the current character is {

    • If the next character is }, it represents an empty object, so we return obj and exit.
  • Enter the while loop and use string to parse the key (since JSON keys must be strings). We will explain the implementation of the string function below.

    • Skip whitespace characters (like in the following example)
    { "name"         : "value" }
    • Expect : to appear
    • Parse the value using the string function (for now, let's assume the value is always a string)
    • Skip whitespace characters
    • If } is encountered, it indicates the end of object parsing, so we return obj
    • If , is encountered, it means we need to continue parsing the next key-value pair
    • Skip whitespace characters

Implementation of string parsing:

function string(parser) {
  parser.next('"'); // If the next character is ", it means we need to parse as a string
  if (parser.next('"')) {
    // We consider { "": "value" } as an invalid JSON
    throw new Error("JSON key is empty.");
  }

  let key = "";
  let curr = "";
  // Continuously read the next character until we encounter "
  while (((curr = parser.current()), curr)) {
    if (parser.next('"')) {
      parser.index += 1;
      return key;
    }
    key += curr;
    parser.index += 1;
  }
}
  • Expect the character to be "
  • If the next character is " it represents an empty string, which is an invalid key, so we throw an error
  • Enter the while loop until we read the character "

If you want to practice, you can refer to the Repo.

Writing a parser by hand can be a bit tricky, and sometimes we may miss handling certain edge cases. I would suggest writing tests from the beginning if you want to practice, as it will save a lot of time spent on manual testing.

Conclusion

Although we have implemented a recursive descent parser, we haven't converted it into an Abstract Syntax Tree (AST) but directly into a JavaScript object. This is because the data structure of JSON is relatively simple and doesn't require further conversion.

Additionally, when implementing a parser, it is worth considering a few points:

  1. To make error messages more user-friendly, we can consider using row and column information instead of just a one-dimensional approach.
  2. Implement more user-friendly error messages, such as reminding developers when they forget to add } or indicating the exact location of syntax errors.

To maintain flexibility, you can also replace symbols with variables (Tokens).

In part 2, we will continue implementing other types, such as numbers, arrays, booleans, null, etc., to make it a more complete JSON parser. We will also start implementing the two custom features mentioned earlier: using @ as a separator and using {} as templates.

Prev

How to do smooth scroll with one line of CSS

Next

Write a JSON parser from scratch (2)

If you found this article helpful, please consider buy me a drink ☕️ It'll make my ordinary day shine✨

Buy me a coffee

作者

Kalan 頭像照片,在淡水拍攝,淺藍背景

愷開 | Kalan

Hi, I'm Kai. I'm Taiwanese and moved to Japan in 2019 for work. Currently settled in Fukuoka. In addition to being familiar with frontend development, I also have experience in IoT, app development, backend, and electronics. Recently, I started playing electric guitar! Feel free to contact me via email for consultations or collaborations or music! I hope to connect with more people through this blog.