Skip to content

Instantly share code, notes, and snippets.

@igrigorik
Created April 20, 2012 06:34
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save igrigorik/2426614 to your computer and use it in GitHub Desktop.
Save igrigorik/2426614 to your computer and use it in GitHub Desktop.
load githubarchive data into sqlite3 database
#
# $> ruby load.rb http://data.githubarchive.org/2012-04-01-15.json.gz
#
require 'yajl'
require 'zlib'
require 'sqlite3'
require 'open-uri'
input = ARGV.shift
# create the SQLite table schema
@schema = open('https://raw.github.com/igrigorik/githubarchive.org/master/bigquery/schema.js')
@schema = Yajl::Parser.parse(@schema.read)
@keys = @schema.map {|r| r['name']}
# map GitHub JSON schema to flat CSV space based
# on provided Big Query column schema
def flatmap(h, e, prefix = '')
e.each do |k,v|
if v.is_a?(Hash)
flatmap(h, v, prefix+k+"_")
else
key = prefix+k
next if !@keys.include? key
case v
when TrueClass then h[key] = 1
when FalseClass then h[key] = 0
else
next if v.nil?
h[key] = v unless v.is_a? Array
end
end
end
h
end
# Create table schema
create_table = "create table if not exists events ( \n"
@schema.each do |column|
create_table += case column['type']
when 'INTEGER', 'BOOLEAN'
"#{column['name']} integer, \n"
when 'STRING'
"#{column['name']} text, \n"
end
end
create_table = create_table.chomp(", \n") + ");"
# load the data
db = SQLite3::Database.open("github.sqlite")
db.execute(create_table)
gz = open(input)
js = Zlib::GzipReader.new(gz).read
Yajl::Parser.parse(js) do |event|
row = flatmap({}, event)
keys, values = row.keys, row.values
db.execute "INSERT INTO events(#{keys.join(',')}) VALUES (#{(['?'] * keys.size).join(',')})", *values
end
# Run a query...
p db.execute("select count(*) from events")
@bagrow
Copy link

bagrow commented Dec 6, 2012

Does this script still work for the latest json schema? I think it's broken/outdated. I've been trying to figure out what's going on, but I don't know ruby. Wish py-yajl supported streams… :(

@bagrow
Copy link

bagrow commented Dec 6, 2012

Ahh there are newlines between the stream elements now, this is new and great…!

@chaconnewu
Copy link

How can I construct a mysql insert statement for this:

db.execute "INSERT INTO events(#{keys.join(',')}) VALUES (#{(['?'] * keys.size).join(',')})", *values

the mysql-ruby module seems cannot do the same.

@ricardorlg
Copy link

i run this script but it doesn't load the payload_commit info of pushevent. any suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment