Skip to content

Instantly share code, notes, and snippets.

@matey-jack
Last active January 26, 2019 18:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save matey-jack/55d8d28f031a7e4a9e36763ff832282f to your computer and use it in GitHub Desktop.
Save matey-jack/55d8d28f031a7e4a9e36763ff832282f to your computer and use it in GitHub Desktop.
Streaming in NodeJs: Processing a file much larger than available RAM

To try this out:

wget https://www.fec.gov/files/bulk-downloads/2018/indiv18.zip
unzip indiv18.zip
node find-most-frequent-name.js itcont.txt

The interesting thing is that while other solutions to the problem require to increase NodeJs' memory limits, we can even restrict the limit and use less memory:

node --max-old-space-size=300 find-most-frequent-name.js itcont.txt

Also worth noting is that no JavaScript dependencies are needed to achieve this. We don't even need to copy or reimplement anything from a library – NodeJs simply has everything needed on board.

I got the idea for writing this example after reading through the code in this blogpost. The author there notes that only one of her solutions (using the EventStream library) could handle the large file without running out of memory. The way she used Node's readline() function seemed a bit suspicious to me (for example, there is an unused output stream of which I had no idea what it does), so I thought I would try to show that pure NodeJs can handle that data really well.

As I gave the problem of finding the most common first name a fresh look, I decided to go directly for a solution which aggregates name counts as the data is read. This did of course use much less memory than the original program. (To her defense comes that the original problem statement actually required to "read all names into an array" so the bad performance is caused by an overspecification of the problem. Not so uncommon in the field, haha!)

Also to the defense of Node's readline: even though it crashes even for a reduced data set of only 5 million entries, the EventStream-based solution also crashes with the full data set of 19 million entries and even crashes already with just 10 million.

So the learning is:

  • NodeJs has great built-in support for processing data way bigger than available RAM (which can actually translate to real saved money if you run in the cloud and can use smaller RAM allocations for less money)
  • more than the technology, rephrasing the problem can bring a better solution (that of course requires an open discussion with whoever stated the problem and depending on circumstance the problem might not be rephrasable after all)

My fork of Paige's repository contains some hints how to easily reduce the size of the data set and fixes so the programs can deal with that stripped down data.

"use strict";
const fs = require('fs');
const readline = require('readline');
const { performance: perf } = require('perf_hooks');
/**
* Finds the largest value of all fields in `obj`
* and returns its field name (key).
*/
function maxEntry(obj) {
let maxValue = 0;
let maxKey = null;
for (let key of Object.keys(obj)) {
if (obj[key] > maxValue) {
maxKey = key;
maxValue = obj[key];
}
}
return maxKey;
}
function getFirstName(name) {
// this code copied verbatim from [Paige Niedringhaus](https://github.com/paigen11/file-read-challenge/blob/master/readFileStream.js)
let firstHalfOfName = name.split(', ')[1];
if (firstHalfOfName !== undefined) {
firstHalfOfName.trim();
// filter out middle initials
if (firstHalfOfName.includes(' ') && firstHalfOfName !== ' ') {
let firstName = firstHalfOfName.split(' ')[0];
return firstName.trim();
} else {
return firstHalfOfName;
}
}
}
function getFirstNameRegEx(name) {
// idea for this taken from [Stuart Marks](https://stuartmarks.wordpress.com/2019/01/11/processing-large-files-in-java/)
// the documentation assures me that a regex in this form is compiled at parse-time
// and is indeed a constant in memory.
const regex = /, (\S+)/;
const match = regex.exec(name);
return match && match[1];
}
const numberFormat4 = new Intl.NumberFormat('en-us', { maximumSignificantDigits: 4 });
const numberFormatFull = new Intl.NumberFormat('en-us');
function main(args) {
console.log(`Opening file '${args[0]}'.`)
const rl = readline.createInterface({
input: fs.createReadStream(args[0]),
crlfDelay: Infinity,
});
const timeStartReadLoop = perf.now();
let timeInsideReadLoop = 0;
const nameCounts = {};
let lineCount = 0;
rl.on('line', (line) => {
const timeStartInside = perf.now();
lineCount++;
const fields = line.split('|');
const date = fields[4];
const name = getFirstNameRegEx(fields[7]); // to count full names, just use `fields[7];`
nameCounts[name] = (nameCounts[name] || 0) + 1;
timeInsideReadLoop += perf.now() - timeStartInside;
});
rl.on('close', function() {
const totalTime = numberFormat4.format(perf.now() - timeStartReadLoop);
const insideTime = numberFormat4.format(timeInsideReadLoop);
console.log(`Total time for reading and processing file: ${totalTime}ms.`);
console.log(`Thereof time spent in custom processing code: ${insideTime}ms.`)
console.log(`Dataset has ${numberFormatFull.format(lineCount)} entries.`);
const numUniqueNames = numberFormatFull.format(Object.keys(nameCounts).length);
console.log(`${numUniqueNames} different names found in file.`)
const name = maxEntry(nameCounts);
const nameOccurrences = numberFormatFull.format(nameCounts[name]);
console.log(`The most common name is '${name}' with ${nameOccurrences} occurrences.`);
});
console.log(`Getting started...`);
}
// note that in older versions of node, you need to `.splice(process.execArgv.length + 2)`
// but my v10.15 seems to already remove the runtime's arguments from the program's.
main(process.argv.splice(2));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment