Kalan's Blog

Software Engineer / Taiwanese / Life in Fukuoka

Current Theme light

In node.js, it is common to use the fs module to manipulate files. However, when dealing with high throughput scenarios, any IO operations should be handled with caution. For example, the following code is commonly used to read a file:

const fs = require('fs')
fs.readFile('./text.txt', (err, data) => {
  if (!err) {
    console.log(data.toString())
  }
})

This approach is fine for reading small files, but it can lead to a significant memory footprint when dealing with large files. In node.js, the maximum buffer size is determined based on the platform's integer pointer length.

console.log(buffer.constants.MAX_LENGTH)
// 4294967296 = 4GB

This means that the above code will fail if the file size exceeds 4GB in node.js.

Working with streams

Experienced developers usually use fs.createReadStream to handle file operations and avoid issues related to file size. The main difference between createReadStream and readFile is that with streams, we can split a large file into several smaller chunks, each with a size of a few KB. For web services, using streams to transmit data is a natural choice. It allows browsers to start rendering HTML content as soon as they receive partial data, without waiting for the entire content to be transmitted.

The rewritten code using createReadStream would look like this:

const fs = require('fs')
let data = ''

const stream = fs.createReadStream('./text.txt')

stream.on('data', chunk => {
  data += chunk
})

stream.on('end', () => {
  console.log(data)
})

At first glance, there doesn't seem to be any issues with this code, and it works well. It can even be applied to most file handling scenarios. However, if we examine data += chunk closely, we'll notice something suspicious. Some developers naturally treat chunk as a string, but the data returned by the stream is actually a buffer. So, data += chunk is essentially concatenating the result of chunk.toString(). At this point, an alarm bell should ring for some developers.

That's right! Encoding is crucial when working with strings. By default, buffer to string conversion uses UTF-8 encoding. Therefore, the data += chunk approach may result in incorrect output because UTF-8 can represent a character using 1, 2, 3, or 4 bytes. For demonstration purposes, I have adjusted the highWaterMark to 5 bytes.

// text.txt
這是一篇部落格PO文
const fs = require('fs')
let data = ''
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })
stream.on('data', chunk => {
	data += chunk
})
stream.on('end', () => {
	console.log(data)
})

The output will be:

這��一���部落��PO��

Since each chunk is limited to 5 bytes, chunk.toString() may result in incorrect encoding because not all the data has been received.

Proper concatenation using Buffer.concat

To correctly work with buffers, it is best to use the APIs provided by node.js and then convert them to strings to avoid encoding issues.

const fs = require('fs')
let data = []
let size = 0
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })

stream.on('data', chunk => {
	data.push(chunk)
  size += chunk.length
})

stream.on('end', () => {
  Buffer.concat(data, size).toString()
})

This way, encoding issues can be avoided. However, this approach can be cumbersome. If you only need to perform simple analysis, using readFile or readFileSync is perfectly fine. However, when dealing with large file analysis or high throughput scenarios, attention to these details becomes necessary. (Side note: At this point, you might consider using another language.)

Conclusion

When working with large files, it is important to avoid loading the entire file into memory at once. Instead, it is recommended to use streams for file operations. When transmitting data, using Buffer for handling can improve throughput and avoid unnecessary encoding issues. However, it is also important to handle encoding-related operations correctly.

Prev

2021 Chitchat

Next

ATOMIC_BLOCK in avr-libc

If you found this article helpful, please consider buy me a drink ☕️ It'll make my ordinary day shine✨

Buy me a coffee

作者

Kalan 頭像照片,在淡水拍攝,淺藍背景

愷開 | Kalan

Hi, I'm Kai. I'm Taiwanese and moved to Japan in 2019 for work. Currently settled in Fukuoka. In addition to being familiar with frontend development, I also have experience in IoT, app development, backend, and electronics. Recently, I started playing electric guitar! Feel free to contact me via email for consultations or collaborations or music! I hope to connect with more people through this blog.