Create a gist now

Instantly share code, notes, and snippets.

load githubarchive data into sqlite3 database
# $> ruby load.rb
require 'yajl'
require 'zlib'
require 'sqlite3'
require 'open-uri'
input = ARGV.shift
# create the SQLite table schema
@schema = open('')
@schema = Yajl::Parser.parse(
@keys = {|r| r['name']}
# map GitHub JSON schema to flat CSV space based
# on provided Big Query column schema
def flatmap(h, e, prefix = '')
e.each do |k,v|
if v.is_a?(Hash)
flatmap(h, v, prefix+k+"_")
key = prefix+k
next if !@keys.include? key
case v
when TrueClass then h[key] = 1
when FalseClass then h[key] = 0
next if v.nil?
h[key] = v unless v.is_a? Array
# Create table schema
create_table = "create table if not exists events ( \n"
@schema.each do |column|
create_table += case column['type']
"#{column['name']} integer, \n"
when 'STRING'
"#{column['name']} text, \n"
create_table = create_table.chomp(", \n") + ");"
# load the data
db ="github.sqlite")
gz = open(input)
js =
Yajl::Parser.parse(js) do |event|
row = flatmap({}, event)
keys, values = row.keys, row.values
db.execute "INSERT INTO events(#{keys.join(',')}) VALUES (#{(['?'] * keys.size).join(',')})", *values
# Run a query...
p db.execute("select count(*) from events")

Does this script still work for the latest json schema? I think it's broken/outdated. I've been trying to figure out what's going on, but I don't know ruby. Wish py-yajl supported streams… :(


Ahh there are newlines between the stream elements now, this is new and great…!


How can I construct a mysql insert statement for this:

db.execute "INSERT INTO events(#{keys.join(',')}) VALUES (#{(['?'] * keys.size).join(',')})", *values

the mysql-ruby module seems cannot do the same.


i run this script but it doesn't load the payload_commit info of pushevent. any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment