Skip to content

Instantly share code, notes, and snippets.

@maxogden maxogden/index.js
Last active Mar 20, 2018

Embed
What would you like to do?
fast loading of a large dataset into leveldb
// data comes from here http://stat-computing.org/dataexpo/2009/the-data.html
// download 1994.csv.bz2 and unpack by running: cat 1994.csv.bz2 | bzip2 -d > 1994.csv
// 1994.csv should be ~5.2 million lines and 500MB
// importing all rows into leveldb took ~50 seconds on my machine
// there are two main techniques at work here:
// 1: never create JS objects, leave the data as binary the entire time (binary-split does this)
// 2: group lines into 16 MB batches, to take advantage of leveldbs batch API (byte-stream does this)
var level = require('level')
var byteStream = require('byte-stream')
var split = require('binary-split')
var fs = require('fs')
var count = 0
var wbs = 1024 * 1024 * 16
var db = level('data.db', {writeBufferSize: wbs}, function(){
var batcher = byteStream(wbs)
fs.createReadStream('1994.csv')
.pipe(split())
.pipe(batcher)
.on('data', function(lines) {
var batch = db.batch()
for (var i = 0; i < lines.length; i++) {
batch.put(count, lines[i])
count++
}
batch.write(batcher.next.bind(batcher))
})
})
@polotek

This comment has been minimized.

Copy link

commented Sep 13, 2013

This is cool. I'm assuming the clever part is separating the lines to write into batches of 16K or less? That's what batcher does right?

@maxogden

This comment has been minimized.

Copy link
Owner Author

commented Sep 13, 2013

@polotek that is one optimization, yes, and another is parsing lines without every creating JS objects (but instead keeping them as binary the entire time)

@joeybaker

This comment has been minimized.

Copy link

commented Sep 13, 2013

Silly optimization, but I bet you can squeak more pref out of this by changing line 24 to: for (var i = 0, l = lines.length; i < l; i++) {

@baudehlo

This comment has been minimized.

Copy link

commented Sep 13, 2013

@joeybaker: V8 already does that optimisation for you.

@aheckmann

This comment has been minimized.

Copy link

commented Sep 13, 2013

nice one. thanks for sharing. didn't know about byte-stream or binary-split.

@maxogden

This comment has been minimized.

Copy link
Owner Author

commented Sep 13, 2013

@aheckmann I wrote them this week :D

@eugeneware

This comment has been minimized.

Copy link

commented Sep 14, 2013

Well done Max!

@heapwolf

This comment has been minimized.

Copy link

commented Sep 14, 2013

OS: Darwin 10.9
Memory: 4 GB 1600 MHz DDR3
Processor: 1.8 GHz Intel Core i5

time node gist.js 
       66.84 real        91.93 user         4.09 sys
@maxogden

This comment has been minimized.

Copy link
Owner Author

commented Sep 14, 2013

I just did a bigger import, all of the 1990s data.

cat 1990.csv.bz2 1991.csv.bz2 1992.csv.bz2 1993.csv.bz2 1994.csv.bz2 1995.csv.bz2 1996.csv.bz2 1997.csv.bz2 1998.csv.bz2 1999.csv.bz2 > 1990s.csv.bz2
cat 1990s.csv.bz2 | bzip2 -d > 1990s.csv

it results in a 52,694,400 line file (5.18GB csv) and takes 11m4.321s to run the above script, which results in a 2.33GB leveldb folder

@soldair

This comment has been minimized.

Copy link

commented Sep 19, 2013

have you tested how the behaves in relation to key size? i'm going to test tomorrow but i was just wondering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.