Kalan's Blog

Kalan 頭像照片,在淡水拍攝,淺藍背景

四零二曜日電子報上線啦!訂閱訂起來

Software Engineer / Taiwanese / Life in Fukuoka
This blog supports RSS feed (all content), you can click RSS icon or setup through third-party service. If there are special styles such as code syntax in the technical article, it is still recommended to browse to the original website for the best experience.

Current Theme light

我會把一些不成文的筆記或是最近的生活雜感放在短筆記,如果有興趣的話可以來看看唷!

Please notice that currenly most of posts are translated by AI automatically and might contain lots of confusion. I'll gradually translate the post ASAP

Explanation of regular expression to add a number to comma

When displaying currency values, there is often a need to format the original number in a more human-readable format, such as:

  • 1234567 → 1,234,567
  • 10000 → 10,000

In frontend development, there are several ways to achieve this:

  • Using Intl.NumberFormat (may require a polyfill for older browsers)
  • Using regular expressions with .replace

There have been many discussions on this topic on StackOverflow, and one of the most popular ones is probably this post: How to print a number with commas as thousands separators in JavaScript

There are multiple solutions, but they generally fall into these two patterns:

const reg1 = /\B(?=(\d{3})+$)/
const reg2 = /(\d)(?=(\d{3})+$)/

This article will attempt to explain the differences between these two regular expressions and their execution. Finally, we will test their performance.

Introduction

Before we begin, there are a few important concepts that need to be understood: positive lookahead, negative lookahead, and word boundary. These concepts are not commonly encountered when learning regular expressions, but they are actually quite powerful.

Positive Lookahead and Negative Lookahead

In regular expressions, positive lookahead is denoted by ?=. For example, a(?=b) means matching a only if it is followed by the letter b. It is important to note that ?= itself does not participate in the match, so this regular expression will only match a.

positive lookahead

As shown in the above diagram, only a is matched by the regular expression.

Lookahead syntax can accept any valid regular expression. For example: ,(?=(?:\d{3})+$). This regular expression matches a comma that is followed by three consecutive digits, and it matches one or more times, ending at the end of the string.

positive lookahead 2

Negative lookahead, denoted by ?!, is the opposite of positive lookahead. For example, a(?!b) matches a only if it is not followed by the letter b.

It is important to note that both positive lookahead and negative lookahead are zero-length expressions, meaning they do not consume any characters in the match. If you use (?=a) without adding any characters, the match will have a length of 0.

positive lookahead

Although there is a successful match, the length of the match is 0, between the characters.

In the regular expressions:

/\B(?=(\d{3})+$)/ and /(?=(\d{3})+$)/, the meanings are the same (with some minor differences). The reason these two expressions are equivalent is because we will introduce \b and \B below.

The Meaning of \b and \B

\b

In regular expressions, the case sensitivity often represents the opposite meaning. For example, \d matches a digit, so \D matches a non-digit. Let's first understand the meaning of \b according to the MDN documentation:

A word boundary matches the position where a word character is not followed or preceded by another word character. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero.

To understand the definition of a word character, we need to understand \w, which is defined as follows:

Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_].

Now that we know what \w means, let's understand the meaning of "a word character is not followed or preceded by another word character". \b can appear in the following situations, and for clarity, we will use the term "word character" to represent \w:

  • At the beginning of a word character
  • Between a word character and a non-word character
  • At the end of a word character

The following diagram illustrates these situations:

word boundary

In fact, you can also understand word boundaries as the edges between characters. It is important to emphasize that if no other characters are added, \b itself is a zero-width match, meaning it does not consume any characters in the match. However, it does not mean that there is no match.

Do not confuse this with cases where there are characters, such as d\b. This means matching the letter d followed by a word boundary. In this case, the actual matched character is d:

word boundary with character

\B

\B represents the opposite of a word boundary, meaning it matches any position that is not a word boundary. In the diagram below, the arrows indicate non-word boundary positions.

non-word boundary

Understanding Regular Expressions Correctly

Understanding regular expressions requires experience. However, it is helpful to have some concepts for practical development. A regular expression can be seen as a state machine transition. For example, \d+ can be represented as follows:

state machine 1

Usually, an initial state is added (e.g., if the input is not a digit, it should not go to state 0), but the main idea is to understand it. Place possible input text in the arrows and determine whether to transition to the next state. If the state is a final state, it means the match is accepted.

state machine

Analyzing the Expressions

Method 1: Matching using zero-length matches

After providing the background information and necessary knowledge, we can finally analyze the first expression: /\B(?=(\d{3})+$)/g.

The initial \B matches a non-word boundary position. Next, let's examine the regular expression after (?=). (\d{3})+ matches one or more consecutive groups of three digits, such as 333, 666, 123, and so on. The regular expression after (?!) matches a single digit. Combining the overall meaning, it matches a non-word boundary position that is not followed by three consecutive digits, and this pattern can occur one or more times.

In this case, the interesting part is (\d{3})+$). This regular expression matches only if the length of the match is a multiple of 3 and it occurs at the end of the string. For example, 123456 has a length that is a multiple of 3, but 12345 does not match because although it matches one \d{3}, it is not at the end of the string.

Using this characteristic, with \B cleverly applied, for the number 1000000, it will match two positions as shown below:

todo拷貝2.001

Therefore, when calling .replace, you can write it like this:

"1000000".replace(/\B(?=(\d{3})+$)/g, ",");

Based on the matching positions shown in the diagram, it will insert , at these two positions, resulting in 1,000,000. This is why this regular expression does not require $1, because both \B and (?=) are zero-length matches, meaning they do not consume any characters in the match, resulting in a match length of 0.

The matching process can be observed in the following video. The number of matches shown in the video is only for reference and may vary depending on the language. Some steps are also omitted, but it roughly demonstrates the process:

Method 2: Matching the digits that should have commas

/(\d)(?=(\d{3})+$)/

From this expression, you can see that it is very similar to the previous one, except that \B is removed and \d is added. Overall, there is not much difference. However, there is one difference: \d actually matches a digit. The final result will look like this:

number-1.001

(The image includes (?:) to indicate that the match result is not captured in a group, but the result is the same)

I personally prefer to use (?:) when the captured value is not used, as it makes it easier for others and future me to understand.

The overall process will be like this: (omitting the unsuccessful match steps)

So in JavaScript, you would write it like this:

"1000000".replace(/(\d)(?=(\d{3})+$)/g, "$1,"); // Note the $1 here

The $1 is important because we want to include the matched digit in the replacement. If we only use ,, it will result in ,00,000.

Other Considerations and Approaches

The two regular expressions mentioned above are based on the condition (?=(\d{3})+$). However, in practice, there may be cases where decimal points are involved, and the expressions may fail to match, such as 1000.12.

In such cases, it may be necessary to modify the expressions to handle the presence of decimal points, for example, by adding \b to enforce word boundaries to stop the match at the decimal point.

Additionally, modern browsers provide the Intl.NumberFormat API, which can be used out of the box without additional configuration. You can refer to the MDN documentation for usage examples.

new Intl.NumberFormat('ja-JP', { style: 'currency', currency: 'JPY' }).format(number);

Performance and Other Considerations

Since the results are the same, there are a few remaining considerations: readability and performance.

In terms of readability and ease of use, Intl.NumberFormat is the best option. The MDN documentation provides clear instructions, and it is very convenient to use.

The only thing to consider is performance. Here is a test using jsbench (link). It can be seen that Intl.NumberFormat is almost twice as slow. This may be due to the loading of internationalization (i18n) and number conversion for different locales.

performance comparison

The matching using zero-length matches is faster than matching using \d, probably due to the zero-length nature. However, it is important to note that expressions like (\d{3})+ perform backtracking to match as many results as possible, which can cause performance issues. Therefore, caution is needed when using such expressions.

In practice, we can use requestIdleCallback to delay the initialization of Intl.NumberFormat to avoid excessive performance impact. Alternatively, we can wrap the logic in a separate function and initialize it only when it is actually called by other files. This should help mitigate performance issues.

Other Approaches

The regular expressions mentioned above are mainly based on lookahead. If we want to achieve the same result using loops, how can we do it? Here is an alternative implementation of /(\d)(?=(?:\d{3})+\b)/g:

let digits = number.toFixed(2).toString();
let matcher = /(\d)(?=(?:\d{3})+\b)/g;

while (matcher.test(digits)) {
  let first = digits.slice(0, matcher.lastIndex);
  let second = digits.slice(matcher.lastIndex);
  digits = first + "," + second;
}

Here is a more intuitive approach that replaces one match at a time:

let digits = number.toFixed(2).toString();
let matcher = /(\d+)(\d{3})/;

while (matcher.test(digits)) {
  digits = digits.replace(matcher, "$1,$2");
}

Let's review the results again:

screenshot

NameOps/s
Zero-length /\B(?=(\d{3})+\b)/g1778943 ops/s fastest
Matching using zero-length matches (without \B)1712701 ops/s 3.72% slower
While loop1371453 ops/s 22.91% slower
Simple loop597173.88 ops/s 66.43% slower
Intl.NumberFormat25304.89 ops/s 98.55% slower

The fastest method is still matching using zero-length matches, followed by the while loop. The slowest method is still Intl.NumberFormat. If you are interested in the test results, you can try them out in the link.

Conclusion

There is a lot to learn about regular expressions, and concepts like lookahead and word boundary are not often mentioned. I have summarized them here. Many concepts are well-documented in the MDN documentation, and the website Regex101 provides a visual representation of regular expressions, making it convenient to understand. However, I still find regular expressions difficult to understand.

Prev

Rethinking About Cookies and CORS

Next

iOS mousedown event triggering issue

If you found this article helpful, please consider buy me a drink ☕️ It'll make my ordinary day shine✨

Buy me a coffee