Kalan's Blog

Kalan 頭像照片,在淡水拍攝,淺藍背景

四零二曜日電子報上線啦!訂閱訂起來

Software Engineer / Taiwanese / Life in Fukuoka
This blog supports RSS feed (all content), you can click RSS icon or setup through third-party service. If there are special styles such as code syntax in the technical article, it is still recommended to browse to the original website for the best experience.

Current Theme light

我會把一些不成文的筆記或是最近的生活雜感放在短筆記,如果有興趣的話可以來看看唷!

Please notice that currenly most of posts are translated by AI automatically and might contain lots of confusion. I'll gradually translate the post ASAP

Some tricks about reading files in node.js.

In node.js, it is common to use the fs module to manipulate files. However, when dealing with high throughput scenarios, any IO operations should be handled with caution. For example, the following code is commonly used to read a file:

const fs = require('fs')
fs.readFile('./text.txt', (err, data) => {
  if (!err) {
    console.log(data.toString())
  }
})

This approach is fine for reading small files, but it can lead to a significant memory footprint when dealing with large files. In node.js, the maximum buffer size is determined based on the platform's integer pointer length.

console.log(buffer.constants.MAX_LENGTH)
// 4294967296 = 4GB

This means that the above code will fail if the file size exceeds 4GB in node.js.

Working with streams

Experienced developers usually use fs.createReadStream to handle file operations and avoid issues related to file size. The main difference between createReadStream and readFile is that with streams, we can split a large file into several smaller chunks, each with a size of a few KB. For web services, using streams to transmit data is a natural choice. It allows browsers to start rendering HTML content as soon as they receive partial data, without waiting for the entire content to be transmitted.

The rewritten code using createReadStream would look like this:

const fs = require('fs')
let data = ''

const stream = fs.createReadStream('./text.txt')

stream.on('data', chunk => {
  data += chunk
})

stream.on('end', () => {
  console.log(data)
})

At first glance, there doesn't seem to be any issues with this code, and it works well. It can even be applied to most file handling scenarios. However, if we examine data += chunk closely, we'll notice something suspicious. Some developers naturally treat chunk as a string, but the data returned by the stream is actually a buffer. So, data += chunk is essentially concatenating the result of chunk.toString(). At this point, an alarm bell should ring for some developers.

That's right! Encoding is crucial when working with strings. By default, buffer to string conversion uses UTF-8 encoding. Therefore, the data += chunk approach may result in incorrect output because UTF-8 can represent a character using 1, 2, 3, or 4 bytes. For demonstration purposes, I have adjusted the highWaterMark to 5 bytes.

// text.txt
這是一篇部落格PO文
const fs = require('fs')
let data = ''
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })
stream.on('data', chunk => {
	data += chunk
})
stream.on('end', () => {
	console.log(data)
})

The output will be:

這��一���部落��PO��

Since each chunk is limited to 5 bytes, chunk.toString() may result in incorrect encoding because not all the data has been received.

Proper concatenation using Buffer.concat

To correctly work with buffers, it is best to use the APIs provided by node.js and then convert them to strings to avoid encoding issues.

const fs = require('fs')
let data = []
let size = 0
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })

stream.on('data', chunk => {
	data.push(chunk)
  size += chunk.length
})

stream.on('end', () => {
  Buffer.concat(data, size).toString()
})

This way, encoding issues can be avoided. However, this approach can be cumbersome. If you only need to perform simple analysis, using readFile or readFileSync is perfectly fine. However, when dealing with large file analysis or high throughput scenarios, attention to these details becomes necessary. (Side note: At this point, you might consider using another language.)

Conclusion

When working with large files, it is important to avoid loading the entire file into memory at once. Instead, it is recommended to use streams for file operations. When transmitting data, using Buffer for handling can improve throughput and avoid unnecessary encoding issues. However, it is also important to handle encoding-related operations correctly.

Prev

2021 Chitchat

Next

ATOMIC_BLOCK in avr-libc

If you found this article helpful, please consider buy me a drink ☕️ It'll make my ordinary day shine✨

Buy me a coffee