Skip to content

Instantly share code, notes, and snippets.

@hyoshiok
Created February 14, 2012 06:53
Show Gist options
  • Save hyoshiok/1824340 to your computer and use it in GitHub Desktop.
Save hyoshiok/1824340 to your computer and use it in GitHub Desktop.
This script guesses the encoding of a csv file.
# encodings: utf-8
=begin :rdoc
= code_detect.rb
This script guesses the encoding of a csv file.
The format of the csv file is the following
0:id, 1:replied_to_id, 2:thread_id, 3:timestamp, 4:group, 5:recipients, 6:private, 7:user_id, 8:user, 9:email_address, 10:body, 11:url, 12:attachment_ids
= usage
ruby code_detect.rb <csv-file>
= output
[+group+] group is created
[+joined+] member joined
[+ASCII+] message is written in ASCII
[+UTF8+] message is written in UTF8. It is likely to be written in Japenese
Author :: hirotaka.yoshioka@mail.rakuten.com
Date :: 4/7/2011 created
= history
4/11/2011 E/J ratio
4/08/2011 add comments.
=end
require 'nkf'
require 'csv'
CODES = {
NKF::UNKNOWN => "UNKNOWN(ASCII)",
NKF::JIS => "JIS",
NKF::EUC => "EUC",
NKF::SJIS => "SJIS",
NKF::BINARY => "BINARY",
NKF::ASCII => "ASCII",
NKF::UTF8 => "UTF8"
}
hash = Hash.new(0)
email = Hash.new(0)
company = Hash.new(0)
begin
i = 0
line=CSV.open(ARGV[0], 'r')
line.shift # the first line is header, so it is omitted
line.each do |row|
body = row[10]
case body
when /\[Group/
hash["group"] += 1
when /\[Tag.*joined\]/
hash["joined"] += 1
else
code = CODES.fetch(NKF.guess(body)) if body != nil
hash[code] += 1 if row[4] != "freetalkja"
end
address = row[9]
email[address] += 1
address =~ /(.+?)@(.+)$/
cname = $2
cname = "nil" if cname == nil
cname = "mail.rakuten.com" if cname =~ /mail.rakuten.co.jp/
company[cname] += 1
i += 1
end
rescue
ensure
print "English/All=","\t",hash["ASCII"]*100.0/(hash["UTF8"]+hash["ASCII"]+1),"\n"
print "Posts ", i , "\n"
print "Posters ", email.size, "\n\n"
hash.each do |key, value|
print key,"\t",value, "\n"
end
print "\n\nCompany \n"
company.sort.each do |key, value|
print sprintf('%20s %5d', key,value), "\n"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment