Skip to content

Instantly share code, notes, and snippets.

@copiousfreetime
Created January 18, 2019 23:00
Show Gist options
  • Save copiousfreetime/00a71556e3c595f8b83674ebc356ab69 to your computer and use it in GitHub Desktop.
Save copiousfreetime/00a71556e3c595f8b83674ebc356ab69 to your computer and use it in GitHub Desktop.
Using Yajl to stream parse a gzipped file
#!/usr/bin/env ruby
require 'yajl'
require 'zlib'
require 'hitimes'
metric = ::Hitimes::TimedValueMetric.new('yajl')
infile = ARGV.shift
count = 0
buffer_size = 1024 * 64
puts "Parsing JSON from #{infile}"
::Zlib::GzipReader.open(infile) do |gz|
metric.start
buffer = String.new("", capacity: buffer_size)
parser = Yajl::Parser.new
parser.on_parse_complete = ->(obj) {
count += 1
# plus do something else with the obj
}
loop do
break if gz.eof?
parser << gz.readpartial(buffer_size, buffer)
end
metric.stop( count )
end
puts "Extracted #{metric.unit_count} records in #{metric.duration} seconds at #{metric.rate} rps"
__END__
# Example usage
% ruby yajl-gzip-stream-parse.rb tmp/d/collected/year\=2019/month\=01/day\=15/hour\=23/2019-01-15-2323_0_1542343534-1542344980.gz
Parsing JSON from tmp/d/collected/year=2019/month=01/day=15/hour=23/2019-01-15-2323_0_1542343534-1542344980.gz
Extracted 497012.0 records in 12.178072577 seconds at 40812.04122060144 rps
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment