Skip to content

Instantly share code, notes, and snippets.

@max-mapper
Last active May 9, 2021 02:20
Show Gist options
  • Save max-mapper/6551333 to your computer and use it in GitHub Desktop.
Save max-mapper/6551333 to your computer and use it in GitHub Desktop.
fast loading of a large dataset into leveldb
// data comes from here http://stat-computing.org/dataexpo/2009/the-data.html
// download 1994.csv.bz2 and unpack by running: cat 1994.csv.bz2 | bzip2 -d > 1994.csv
// 1994.csv should be ~5.2 million lines and 500MB
// importing all rows into leveldb took ~50 seconds on my machine
// there are two main techniques at work here:
// 1: never create JS objects, leave the data as binary the entire time (binary-split does this)
// 2: group lines into 16 MB batches, to take advantage of leveldbs batch API (byte-stream does this)
var level = require('level')
var byteStream = require('byte-stream')
var split = require('binary-split')
var fs = require('fs')
var count = 0
var wbs = 1024 * 1024 * 16
var db = level('data.db', {writeBufferSize: wbs}, function(){
var batcher = byteStream(wbs)
fs.createReadStream('1994.csv')
.pipe(split())
.pipe(batcher)
.on('data', function(lines) {
var batch = db.batch()
for (var i = 0; i < lines.length; i++) {
batch.put(count, lines[i])
count++
}
batch.write(batcher.next.bind(batcher))
})
})
@polotek
Copy link

polotek commented Sep 13, 2013

This is cool. I'm assuming the clever part is separating the lines to write into batches of 16K or less? That's what batcher does right?

@max-mapper
Copy link
Author

@polotek that is one optimization, yes, and another is parsing lines without every creating JS objects (but instead keeping them as binary the entire time)

@joeybaker
Copy link

Silly optimization, but I bet you can squeak more pref out of this by changing line 24 to: for (var i = 0, l = lines.length; i < l; i++) {

@baudehlo
Copy link

@joeybaker: V8 already does that optimisation for you.

@aheckmann
Copy link

nice one. thanks for sharing. didn't know about byte-stream or binary-split.

@max-mapper
Copy link
Author

@aheckmann I wrote them this week :D

@eugeneware
Copy link

Well done Max!

@heapwolf
Copy link

OS: Darwin 10.9
Memory: 4 GB 1600 MHz DDR3
Processor: 1.8 GHz Intel Core i5

time node gist.js 
       66.84 real        91.93 user         4.09 sys

@max-mapper
Copy link
Author

I just did a bigger import, all of the 1990s data.

cat 1990.csv.bz2 1991.csv.bz2 1992.csv.bz2 1993.csv.bz2 1994.csv.bz2 1995.csv.bz2 1996.csv.bz2 1997.csv.bz2 1998.csv.bz2 1999.csv.bz2 > 1990s.csv.bz2
cat 1990s.csv.bz2 | bzip2 -d > 1990s.csv

it results in a 52,694,400 line file (5.18GB csv) and takes 11m4.321s to run the above script, which results in a 2.33GB leveldb folder

@soldair
Copy link

soldair commented Sep 19, 2013

have you tested how the behaves in relation to key size? i'm going to test tomorrow but i was just wondering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment