Some tricks about reading files in node.js.

💡

If you have any questions or feedback, pleasefill out this form

Working with Streams
Correct Concatenation Method:
Conclusion

This post is translated by ChatGPT and originally written in Mandarin, so there may be some inaccuracies or mistakes.

In Node.js, working with files using the fs module is a common practice. However, when dealing with high throughput scenarios, any I/O operations should be approached with caution. For example, here’s a typical piece of code for reading a file:

const fs = require('fs')
fs.readFile('./text.txt', (err, data) => {
  if (!err) {
    console.log(data.toString())
  }
})

This approach works fine for reading small files, but if the file is too large, it can lead to a significant memory footprint. Node.js determines the maximum Buffer size based on the platform's integer pointer length.

console.log(buffer.constants.MAX_LENGTH)
// 4294967296 = 4GB

This means that if the file exceeds 4GB, the above code becomes problematic, and Node.js will throw an error.

Working with Streams

Experienced developers often use fs.createReadStream to handle files to avoid issues related to file size. The main difference from readFile is that using streams allows us to break a large file into several chunks, each only a few dozen KB in size. For web services, transmitting data via streams is quite natural; it even enables the browser to start rendering HTML content as soon as it receives part of it, rather than waiting for the entire content to finish.

Here's how to rewrite the code using createReadStream:

const fs = require('fs')
let data = ''

const stream = fs.createReadStream('./text.txt')

stream.on('data', chunk => {
  data += chunk
})

stream.on('end', () => {
  console.log(data)
})

At first glance, everything seems fine, and the program runs well, even applicable to most file processing tasks. However, a closer look at data += chunk reveals a potential issue. Some developers might treat chunk as a string without realizing that the data returned by the stream is actually a buffer. Thus, data += chunk is effectively concatenating chunk.toString() behind the scenes. At this point, some developers may start to feel alarmed.

That's right! The most crucial aspect of working with strings is the encoding. By default, converting a buffer to a string uses UTF-8 encoding. Therefore, the way data += chunk is written could lead to incorrect results, as UTF-8 can represent a single character with 1, 2, 3, or 4 bytes. For demonstration purposes, I adjusted the highWaterMark to 5 bytes.

// text.txt
這是一篇部落格PO文

const fs = require('fs')
let data = ''
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })
stream.on('data', chunk => {
	data += chunk
})
stream.on('end', () => {
	console.log(data)
})

The output will be:

這��一���部落��PO��

Since each chunk is a maximum of 5 bytes, chunk.toString() may produce incorrect encoding because not all data has been received yet.

Correct Concatenation Method: `Buffer.concat`

To use Buffers correctly, it’s best to leverage the API provided by Node.js and then convert to a string, avoiding encoding issues.

const fs = require('fs')
let data = []
let size = 0
const stream = fs.createReadStream('./text.txt', { highWaterMark: 5 })

stream.on('data', chunk => {
	data.push(chunk)
  size += chunk.length
})

stream.on('end', () => {
  Buffer.concat(data, size).toString()
})

This approach avoids encoding problems. However, writing it this way can be quite cumbersome. If you’re just doing simple analysis, using readFile or readFileSync isn’t a bad idea. But when handling large file analyses or high throughput, these details become critical. ~~(A side note: at this point, you might just opt to use a different programming language)~~

Conclusion

When working with large files, avoid loading them entirely into memory. Instead, opt for stream-based operations, and when transmitting data, use Buffer to enhance throughput and avoid unnecessary encoding issues, while also being mindful of encoding-related operations.

← 2021 Chitchat ATOMIC_BLOCK in avr-libc →

If you found this article helpful, please consider buying me a coffee ☕ It'll make my ordinary day shine ✨

☕Buy me a coffee

Some tricks about reading files in node.js.

Table of Contents

Working with Streams

Correct Concatenation Method: `Buffer.concat`

Conclusion

Table of Contents

Some tricks about reading files in node.js.

Table of Contents

Working with Streams

Correct Concatenation Method: Buffer.concat

Conclusion

Table of Contents

Correct Concatenation Method: `Buffer.concat`