Skip to content

Instantly share code, notes, and snippets.

Last active May 9, 2021 02:20
  • Star 34 You must be signed in to star a gist
  • Fork 9 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
fast loading of a large dataset into leveldb
// data comes from here
// download 1994.csv.bz2 and unpack by running: cat 1994.csv.bz2 | bzip2 -d > 1994.csv
// 1994.csv should be ~5.2 million lines and 500MB
// importing all rows into leveldb took ~50 seconds on my machine
// there are two main techniques at work here:
// 1: never create JS objects, leave the data as binary the entire time (binary-split does this)
// 2: group lines into 16 MB batches, to take advantage of leveldbs batch API (byte-stream does this)
var level = require('level')
var byteStream = require('byte-stream')
var split = require('binary-split')
var fs = require('fs')
var count = 0
var wbs = 1024 * 1024 * 16
var db = level('data.db', {writeBufferSize: wbs}, function(){
var batcher = byteStream(wbs)
.on('data', function(lines) {
var batch = db.batch()
for (var i = 0; i < lines.length; i++) {
batch.put(count, lines[i])
Copy link

polotek commented Sep 13, 2013

This is cool. I'm assuming the clever part is separating the lines to write into batches of 16K or less? That's what batcher does right?

Copy link

@polotek that is one optimization, yes, and another is parsing lines without every creating JS objects (but instead keeping them as binary the entire time)

Copy link

Silly optimization, but I bet you can squeak more pref out of this by changing line 24 to: for (var i = 0, l = lines.length; i < l; i++) {

Copy link

@joeybaker: V8 already does that optimisation for you.

Copy link

nice one. thanks for sharing. didn't know about byte-stream or binary-split.

Copy link

@aheckmann I wrote them this week :D

Copy link

Well done Max!

Copy link

OS: Darwin 10.9
Memory: 4 GB 1600 MHz DDR3
Processor: 1.8 GHz Intel Core i5

time node gist.js 
       66.84 real        91.93 user         4.09 sys

Copy link

I just did a bigger import, all of the 1990s data.

cat 1990.csv.bz2 1991.csv.bz2 1992.csv.bz2 1993.csv.bz2 1994.csv.bz2 1995.csv.bz2 1996.csv.bz2 1997.csv.bz2 1998.csv.bz2 1999.csv.bz2 > 1990s.csv.bz2
cat 1990s.csv.bz2 | bzip2 -d > 1990s.csv

it results in a 52,694,400 line file (5.18GB csv) and takes 11m4.321s to run the above script, which results in a 2.33GB leveldb folder

Copy link

soldair commented Sep 19, 2013

have you tested how the behaves in relation to key size? i'm going to test tomorrow but i was just wondering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment