Recursive descent can be described as an intuitive and powerful parsing method.
Today, we will start with parsing JSON and explain how to build a JSON parser from scratch. JSON has a simple structure, making it suitable for practice. Although parsing syntax is not directly related to frontend or regular development, these techniques can come in handy when designing a DSL or creating a custom language for specific requirements.
Why Learn Recursive Descent?
Why learn recursive descent when we already have JSON.parse
? By analyzing the syntax, we can not only parse JSON data but also implement our own syntax and apply these techniques elsewhere.
For example, a typical JSON looks like this:
{
"name": "kalan",
"age": 20
}
Suppose we want to introduce a new syntax called templates, where any string enclosed in {}
will be automatically replaced with a variable, like this:
{
"name": "kalan",
"age": 20,
"job": $job$
}
By using a customized parse
function:
parse(json, { job: 'engineer' });
// {
// "name": "kalan",
// "age": 20,
// "job": "engineer"
// }
Or suppose we want to use a different symbol:
{
"name" @ "kalan"
"age" @ 20
}
Today, we will write a recursive descent parser that can parse JSON and modify it to include the two aforementioned custom features.
What is Recursive Descent?
When talking about recursive descent, we often encounter various scary terms and symbols, such as LL, LR, top-down, non-terminal, etc. Let's try to discuss it in the most intuitive way. Let's look at the diagram:
We represent the structure of JSON in the form of a diagram. We can see that JSON is mainly composed of key
+ :
+ value
, and the value can be further divided into strings, booleans, numbers, null, objects, etc.
If you notice that when the value is an object, the same rules can be applied again, which means going back to the top of the diagram and continuing the parsing with the same rules.
This recursive looping is called recursion, and the process of matching rules from top to bottom is called top-down. Together, they are known as recursive descent.
For detailed definitions of each type and syntax, we can refer to the JSON specification:
Let's start with a simple example: an object with only one key-value pair, where the value is a string.
{
"name": "kalan"
}
The following steps (refer to the diagram):
- Encountering
{
, indicating the start of object parsing. - Entering
key
parsing, expecting a string. - Encountering
"
, starting string parsing. - Encountering n, a, m, e: treating "name" as the key.
- Encountering
"
, ending string matching, expecting:
to comply with the rules. - Encountering
:
, enteringvalue
parsing. - Encountering
"
, starting string parsing. - Encountering k, a, l, a, n: setting the value as "kalan".
- Encountering
"
, ending string matching. - Encountering
}
, ending object matching. - Encountering an empty string (EOF), ending parsing.
From the process, we can observe that a parsing process resembles a series of state changes. Whenever we encounter a matching value, we follow the rules in the diagram to determine whether to proceed to the next box. Each box can accept different values. If we can successfully go from left to right, it indicates a successful match.
Next, let's try implementing it.
Implementation of a Simple Parser
A simple parser can be achieved using a pointer, position information, and the number of characters read.
We use the parse
function as the entry point of the program, executing the corresponding functions based on the current state.
First, let's define the functions we will use:
class Parser {
constructor(raw) {
this.raw = raw;
this.index = 0;
}
// Current character
current() {
return this.raw[this.index];
}
// Check if the matched string after this is str
// e.g: next('{')
// Check if the next character is '{'
// If it is, move the index accordingly
// and return true to indicate the state
next(str) {
if (this.raw.slice(this.index, this.index + str.length) === str) {
this.index += str.length;
return true;
}
return false;
}
// Skip whitespace, currently only considering " "
// Tab and other functionalities are not implemented
skip() {
while (this.raw[this.index] <= " ") {
this.index += 1;
}
}
parse() {
const value = object(this); // entry point
return value;
}
}
The main entry point is the parse
function. We also define some helper functions, such as skip
and next
, to facilitate checking the current state. Now let's look at the implementation of the object
function:
function object(parser) {
const obj = Object.create(null);
let key = "";
// If it starts with '{', it means we need to parse using the object function
if (parser.current() === "{") {
parser.next("{");
parser.skip();
// If we immediately encounter '}', it represents an empty object {}
if (parser.current() === "}") {
parser.next("}");
return obj;
}
// Read the current character and determine which function to use for parsing
while (parser.current()) {
key = string(parser); // The key must be a string, so we use the string parser
parser.skip(); // Skip whitespace characters
parser.next(":"); // Expecting a ':'
obj[key] = string(parser); // Assuming the value is always a string for now
parser.skip();
if (parser.current() === "}") { // If '}' is encountered, it indicates the end of object parsing, so we return obj
parser.next("}");
return obj;
}
parser.next(","); // If there is a comma, it means there is another key-value pair, so continue parsing in the while loop
parser.skip();
}
}
}
-
If the current character is
{
- If the next character is
}
, it represents an empty object, so we return obj and exit.
- If the next character is
-
Enter the while loop and use
string
to parse the key (since JSON keys must be strings). We will explain the implementation of thestring
function below.- Skip whitespace characters (like in the following example)
{ "name" : "value" }
- Expect
:
to appear - Parse the value using the
string
function (for now, let's assume the value is always a string) - Skip whitespace characters
- If
}
is encountered, it indicates the end of object parsing, so we return obj - If
,
is encountered, it means we need to continue parsing the next key-value pair - Skip whitespace characters
Implementation of string parsing:
function string(parser) {
parser.next('"'); // If the next character is ", it means we need to parse as a string
if (parser.next('"')) {
// We consider { "": "value" } as an invalid JSON
throw new Error("JSON key is empty.");
}
let key = "";
let curr = "";
// Continuously read the next character until we encounter "
while (((curr = parser.current()), curr)) {
if (parser.next('"')) {
parser.index += 1;
return key;
}
key += curr;
parser.index += 1;
}
}
- Expect the character to be
"
- If the next character is
"
it represents an empty string, which is an invalid key, so we throw an error - Enter the while loop until we read the character
"
If you want to practice, you can refer to the Repo.
Writing a parser by hand can be a bit tricky, and sometimes we may miss handling certain edge cases. I would suggest writing tests from the beginning if you want to practice, as it will save a lot of time spent on manual testing.
Conclusion
Although we have implemented a recursive descent parser, we haven't converted it into an Abstract Syntax Tree (AST) but directly into a JavaScript object. This is because the data structure of JSON is relatively simple and doesn't require further conversion.
Additionally, when implementing a parser, it is worth considering a few points:
- To make error messages more user-friendly, we can consider using row and column information instead of just a one-dimensional approach.
- Implement more user-friendly error messages, such as reminding developers when they forget to add
}
or indicating the exact location of syntax errors.
To maintain flexibility, you can also replace symbols with variables (Tokens).
In part 2, we will continue implementing other types, such as numbers, arrays, booleans, null, etc., to make it a more complete JSON parser. We will also start implementing the two custom features mentioned earlier: using @
as a separator and using {}
as templates.