Skip to content

Instantly share code, notes, and snippets.

@gouravtiwari
Last active December 22, 2015 10:48
Show Gist options
  • Save gouravtiwari/6460788 to your computer and use it in GitHub Desktop.
Save gouravtiwari/6460788 to your computer and use it in GitHub Desktop.
Split CSVs without parsing and then parse them using Ruby threads, a continuation of http://grosser.it/2011/08/31/splitting-1-big-csv-file-into-multiple-smaller-without-parsing-it/
#http://grosser.it/2011/08/31/splitting-1-big-csv-file-into-multiple-smaller-without-parsing-it/
require 'rubygems'
require 'rake'
# split giga-csv into n smaller files
def self.split_csv(original, file_count)
header_lines = 1
lines = `cat #{original} | wc -l`.to_i - header_lines
lines_per_file = (lines / file_count) + header_lines
header = `head -n #{header_lines} #{original}`
start = header_lines
generated_files = []
file_count.times do |i|
finish = start + lines_per_file
file = "#{original}-#{i}.csv"
File.open(file,'w'){|f| f.write header }
sh "tail -n #{lines - start} #{original} | head -n #{lines_per_file} >> #{file}"
start = finish
generated_files << file
end
generated_files
end
# split and parse
files = split_csv('test.csv', 10)
threads = []
files.each_with_index do |file, index|
threads << Thread.new(index) {|index|
rows = 0
IO.foreach(file) do |line|
rows +=1
end
Thread.current['rows_count'] = rows
}
end
threads.each{|t| t.join; print "#{t['rows_count']}, "}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment