-
-
Save max-mapper/6551333 to your computer and use it in GitHub Desktop.
// data comes from here http://stat-computing.org/dataexpo/2009/the-data.html | |
// download 1994.csv.bz2 and unpack by running: cat 1994.csv.bz2 | bzip2 -d > 1994.csv | |
// 1994.csv should be ~5.2 million lines and 500MB | |
// importing all rows into leveldb took ~50 seconds on my machine | |
// there are two main techniques at work here: | |
// 1: never create JS objects, leave the data as binary the entire time (binary-split does this) | |
// 2: group lines into 16 MB batches, to take advantage of leveldbs batch API (byte-stream does this) | |
var level = require('level') | |
var byteStream = require('byte-stream') | |
var split = require('binary-split') | |
var fs = require('fs') | |
var count = 0 | |
var wbs = 1024 * 1024 * 16 | |
var db = level('data.db', {writeBufferSize: wbs}, function(){ | |
var batcher = byteStream(wbs) | |
fs.createReadStream('1994.csv') | |
.pipe(split()) | |
.pipe(batcher) | |
.on('data', function(lines) { | |
var batch = db.batch() | |
for (var i = 0; i < lines.length; i++) { | |
batch.put(count, lines[i]) | |
count++ | |
} | |
batch.write(batcher.next.bind(batcher)) | |
}) | |
}) |
@polotek that is one optimization, yes, and another is parsing lines without every creating JS objects (but instead keeping them as binary the entire time)
Silly optimization, but I bet you can squeak more pref out of this by changing line 24 to: for (var i = 0, l = lines.length; i < l; i++) {
@joeybaker: V8 already does that optimisation for you.
nice one. thanks for sharing. didn't know about byte-stream or binary-split.
@aheckmann I wrote them this week :D
Well done Max!
OS: Darwin 10.9
Memory: 4 GB 1600 MHz DDR3
Processor: 1.8 GHz Intel Core i5
time node gist.js
66.84 real 91.93 user 4.09 sys
I just did a bigger import, all of the 1990s data.
cat 1990.csv.bz2 1991.csv.bz2 1992.csv.bz2 1993.csv.bz2 1994.csv.bz2 1995.csv.bz2 1996.csv.bz2 1997.csv.bz2 1998.csv.bz2 1999.csv.bz2 > 1990s.csv.bz2
cat 1990s.csv.bz2 | bzip2 -d > 1990s.csv
it results in a 52,694,400 line file (5.18GB csv) and takes 11m4.321s to run the above script, which results in a 2.33GB leveldb folder
have you tested how the behaves in relation to key size? i'm going to test tomorrow but i was just wondering.
This is cool. I'm assuming the clever part is separating the lines to write into batches of 16K or less? That's what
batcher
does right?