Skip to content

Instantly share code, notes, and snippets.

@aaronbbrown
Created November 19, 2012 21:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aaronbbrown/4113947 to your computer and use it in GitHub Desktop.
Save aaronbbrown/4113947 to your computer and use it in GitHub Desktop.
Fixing broken CSV data
#!/usr/bin/env ruby
require 'pp'
require 'csv'
require 'logger'
def join_and_quote ( a )
a.to_csv(:row_sep => "\r\n", :quote_char => '"')
end
def parse_csv ( line )
replstring = '!!!$$$!!!'
begin
replaced = line.gsub("\r\n", replstring).gsub("\n",replstring)
a = CSV::parse_line(replaced, :row_sep => "\r\n", :col_sep => ',', :quote_char => '"' ) || []
a.map { |x| x.is_a?(String) ? x.gsub(replstring,"\n") : x }
rescue
[]
end
end
def fix_encoding ( line )
encoded = line.encode('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
encoded.encode('UTF-8', 'UTF-16')
end
num_fields = nil
this_record = []
record_str = nil
badfn = "bad.csv"
badfile = File.open(badfn, "w")
lines = 100000
file_num = 0
base_fn = "transactions.csv"
f = nil
logger = Logger.new $stdout
logger.level = Logger::DEBUG
ARGF.each_with_index("\r\n") do |line,i|
encoded = fix_encoding(line).rstrip
if ( i % lines ) == 0
file_num += 1
f.close if f
fn = sprintf("%s.%0.6d", base_fn, file_num)
logger.debug("Loaded #{i} lines. Rotating to #{fn}")
f = File.open(fn, "w")
end
begin
if i == 0
num_fields = parse_csv(encoded).size
next
end
if record_str
record_str = record_str + "\n" + encoded
else
record_str = encoded
end
this_record = parse_csv(record_str)
if this_record.size > num_fields
logger.warn("Found invalid record on line #{i} of source file. Logging to #{badfn}")
badfile.print this_record.size
badfile.print "\t"
badfile.print encoded
badfile.print "\r\n"
end
next unless this_record.size >= num_fields
f.print join_and_quote(this_record)
rescue Exception => e
$stderr.puts e
$stderr.puts encoded
ensure
this_record = []
record_str = nil
end
end
badfile.close
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment