Skip to content

Instantly share code, notes, and snippets.

@aisrael
Last active June 19, 2018 07:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aisrael/584523d4082275b7d7d8a8c4797a2581 to your computer and use it in GitHub Desktop.
Save aisrael/584523d4082275b7d7d8a8c4797a2581 to your computer and use it in GitHub Desktop.
require "csv"
require "gzip"
MINIMUM_TIME = Time.parse_rfc3339("2017-11-02T00:00:00.000Z")
# process_file processes the .csv.gz files as a stream of bytes counting all records that
# meet the minimum date
def process_file(filename : String) : NamedTuple(total: Int32, matched: Int32)
puts "Processing: #{filename}"
total = 0
matched = 0
File.open(filename) do |file|
Gzip::Reader.open(file) do |gzip|
CSV.each_row(gzip) do |row|
time = Time.parse_rfc3339(row[3])
matched += 1 if time > MINIMUM_TIME
total += 1
end
end
end
{total: total, matched: matched}
end
START_TIME = Time.now
total = 0
matched = 0
Dir.glob("./testdata/*.csv.gz") do |filename|
result = process_file(filename)
total += result[:total]
matched += result[:matched]
end
END_TIME = Time.now
TOTAL_TIME = END_TIME - START_TIME
printf "Total: %d, Matched: %d, Ratio: %0.2f%%\n", total, matched, (matched.to_f*100.0/total.to_f)
puts "Time: #{TOTAL_TIME}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment