Skip to content

Instantly share code, notes, and snippets.

@max-mapper
Last active May 9, 2021 02:20
Show Gist options
  • Save max-mapper/6551333 to your computer and use it in GitHub Desktop.
Save max-mapper/6551333 to your computer and use it in GitHub Desktop.
fast loading of a large dataset into leveldb
// data comes from here http://stat-computing.org/dataexpo/2009/the-data.html
// download 1994.csv.bz2 and unpack by running: cat 1994.csv.bz2 | bzip2 -d > 1994.csv
// 1994.csv should be ~5.2 million lines and 500MB
// importing all rows into leveldb took ~50 seconds on my machine
// there are two main techniques at work here:
// 1: never create JS objects, leave the data as binary the entire time (binary-split does this)
// 2: group lines into 16 MB batches, to take advantage of leveldbs batch API (byte-stream does this)
var level = require('level')
var byteStream = require('byte-stream')
var split = require('binary-split')
var fs = require('fs')
var count = 0
var wbs = 1024 * 1024 * 16
var db = level('data.db', {writeBufferSize: wbs}, function(){
var batcher = byteStream(wbs)
fs.createReadStream('1994.csv')
.pipe(split())
.pipe(batcher)
.on('data', function(lines) {
var batch = db.batch()
for (var i = 0; i < lines.length; i++) {
batch.put(count, lines[i])
count++
}
batch.write(batcher.next.bind(batcher))
})
})
@eugeneware
Copy link

Well done Max!

@heapwolf
Copy link

OS: Darwin 10.9
Memory: 4 GB 1600 MHz DDR3
Processor: 1.8 GHz Intel Core i5

time node gist.js 
       66.84 real        91.93 user         4.09 sys

@max-mapper
Copy link
Author

I just did a bigger import, all of the 1990s data.

cat 1990.csv.bz2 1991.csv.bz2 1992.csv.bz2 1993.csv.bz2 1994.csv.bz2 1995.csv.bz2 1996.csv.bz2 1997.csv.bz2 1998.csv.bz2 1999.csv.bz2 > 1990s.csv.bz2
cat 1990s.csv.bz2 | bzip2 -d > 1990s.csv

it results in a 52,694,400 line file (5.18GB csv) and takes 11m4.321s to run the above script, which results in a 2.33GB leveldb folder

@soldair
Copy link

soldair commented Sep 19, 2013

have you tested how the behaves in relation to key size? i'm going to test tomorrow but i was just wondering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment