If you have any questions or feedback, pleasefill out this form
Table of Contents
This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.
In Node.js, working with files using the fs
module is a common practice. However, when dealing with high throughput scenarios, any I/O operations should be approached with caution. For example, here’s a typical piece of code for reading a file:
const fs = require('fs')
fs.readFile('./text.txt', (err, data) => {
if (!err) {
console.log(data.toString())
}
})
This approach works fine for reading small files, but if the file is too large, it can lead to a significant memory footprint. Node.js determines the maximum Buffer size based on the platform's integer pointer length.
console.log(buffer.constants.MAX_LENGTH)
// 4294967296 = 4GB
This means that if the file exceeds 4GB, the above code becomes problematic, and Node.js will throw an error.
Working with Streams
Experienced developers often use fs.createReadStream
to handle files to avoid issues related to file size. The main difference from readFile
is that using streams allows us to break a large file into several chunks, each only a few dozen KB in size. For web services, transmitting data via streams is quite natural; it even enables the browser to start rendering HTML content as soon as it receives part of it, rather than waiting for the entire content to finish.
Here's how to rewrite the code using createReadStream
:
const fs = require('fs')
let data = ''
const stream = fs.createReadStream('./text.txt')
stream.on('data', chunk => {
data += chunk
})
stream.on('end', () => {
console.log(data)
})
At first glance, everything seems fine, and the program runs well, even applicable to most file processing tasks. However, a closer look at data += chunk
reveals a potential issue. Some developers might treat chunk
as a string without realizing that the data returned by the stream is actually a buffer
. Thus, data += chunk
is effectively concatenating chunk.toString()
behind the scenes. At this point, some developers may start to feel alarmed.
That's right! The most crucial aspect of working with strings is the encoding. By default, converting a buffer to a string uses UTF-8 encoding. Therefore, the way data += chunk
is written could lead to incorrect results, as UTF-8 can represent a single character with 1, 2, 3, or 4 bytes. For demonstration purposes, I adjusted the highWaterMark
to 5 bytes.
// text.txt
這是一篇部落格PO文
const fs = require('fs')
let data = ''
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })
stream.on('data', chunk => {
data += chunk
})
stream.on('end', () => {
console.log(data)
})
The output will be:
這��一���部落��PO��
Since each chunk is a maximum of 5 bytes, chunk.toString()
may produce incorrect encoding because not all data has been received yet.
Correct Concatenation Method: Buffer.concat
To use Buffers correctly, it’s best to leverage the API provided by Node.js and then convert to a string, avoiding encoding issues.
const fs = require('fs')
let data = []
let size = 0
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })
stream.on('data', chunk => {
data.push(chunk)
size += chunk.length
})
stream.on('end', () => {
Buffer.concat(data, size).toString()
})
This approach avoids encoding problems. However, writing it this way can be quite cumbersome. If you’re just doing simple analysis, using readFile
or readFileSync
isn’t a bad idea. But when handling large file analyses or high throughput, these details become critical. (A side note: at this point, you might just opt to use a different programming language)
Conclusion
When working with large files, avoid loading them entirely into memory. Instead, opt for stream-based operations, and when transmitting data, use Buffer
to enhance throughput and avoid unnecessary encoding issues, while also being mindful of encoding-related operations.
If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨
☕Buy me a coffee