In node.js, it is common to use the fs
module to manipulate files. However, when dealing with high throughput scenarios, any IO operations should be handled with caution. For example, the following code is commonly used to read a file:
const fs = require('fs')
fs.readFile('./text.txt', (err, data) => {
if (!err) {
console.log(data.toString())
}
})
This approach is fine for reading small files, but it can lead to a significant memory footprint when dealing with large files. In node.js, the maximum buffer size is determined based on the platform's integer pointer length.
console.log(buffer.constants.MAX_LENGTH)
// 4294967296 = 4GB
This means that the above code will fail if the file size exceeds 4GB in node.js.
Working with streams
Experienced developers usually use fs.createReadStream
to handle file operations and avoid issues related to file size. The main difference between createReadStream
and readFile
is that with streams, we can split a large file into several smaller chunks, each with a size of a few KB. For web services, using streams to transmit data is a natural choice. It allows browsers to start rendering HTML content as soon as they receive partial data, without waiting for the entire content to be transmitted.
The rewritten code using createReadStream
would look like this:
const fs = require('fs')
let data = ''
const stream = fs.createReadStream('./text.txt')
stream.on('data', chunk => {
data += chunk
})
stream.on('end', () => {
console.log(data)
})
At first glance, there doesn't seem to be any issues with this code, and it works well. It can even be applied to most file handling scenarios. However, if we examine data += chunk
closely, we'll notice something suspicious. Some developers naturally treat chunk
as a string, but the data returned by the stream is actually a buffer
. So, data += chunk
is essentially concatenating the result of chunk.toString()
. At this point, an alarm bell should ring for some developers.
That's right! Encoding is crucial when working with strings. By default, buffer to string conversion uses UTF-8 encoding. Therefore, the data += chunk
approach may result in incorrect output because UTF-8 can represent a character using 1, 2, 3, or 4 bytes. For demonstration purposes, I have adjusted the highWaterMark
to 5 bytes.
// text.txt
這是一篇部落格PO文
const fs = require('fs')
let data = ''
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })
stream.on('data', chunk => {
data += chunk
})
stream.on('end', () => {
console.log(data)
})
The output will be:
這��一���部落��PO��
Since each chunk is limited to 5 bytes, chunk.toString()
may result in incorrect encoding because not all the data has been received.
Proper concatenation using Buffer.concat
To correctly work with buffers, it is best to use the APIs provided by node.js and then convert them to strings to avoid encoding issues.
const fs = require('fs')
let data = []
let size = 0
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })
stream.on('data', chunk => {
data.push(chunk)
size += chunk.length
})
stream.on('end', () => {
Buffer.concat(data, size).toString()
})
This way, encoding issues can be avoided. However, this approach can be cumbersome. If you only need to perform simple analysis, using readFile
or readFileSync
is perfectly fine. However, when dealing with large file analysis or high throughput scenarios, attention to these details becomes necessary. (Side note: At this point, you might consider using another language.)
Conclusion
When working with large files, it is important to avoid loading the entire file into memory at once. Instead, it is recommended to use streams for file operations. When transmitting data, using Buffer
for handling can improve throughput and avoid unnecessary encoding issues. However, it is also important to handle encoding-related operations correctly.