Skip to content

Instantly share code, notes, and snippets.

@satoryu
Forked from hyoshiok/count_utf8.rb
Created April 16, 2012 15:11
Show Gist options
  • Save satoryu/2399377 to your computer and use it in GitHub Desktop.
Save satoryu/2399377 to your computer and use it in GitHub Desktop.
This script guesses the encoding of a csv file, if it is not ASCII, count number of utf8 of a body and compute utf8 / size.
# encodings: utf-8
=begin :rdoc
= count_utf8.rb
This script guesses the encoding of a csv file and print ratio of utf8 / size of the message
The format of the csv file is the following
id, replied_to_id, thread_id, timestamp, group, recipients, private, user_id, user, email_address, body, url, attachment_ids
= usage
ruby count_utf8.rb <csv-file>
= output
[+group+] group is created
[+joined+] member joined
[+ASCII+] message is written in ASCII
[+UTF8+] message is written in UTF8. It is likely to be written in Japenese
Author :: hirotaka.yoshioka@mail.rakuten.com
Date :: 4/16/2012 created
= history
4/16/2012 fork from code_utf8.rb
=end
require 'nkf'
require 'csv'
CODES = {
NKF::UNKNOWN => "UNKNOWN(ASCII)",
NKF::JIS => "JIS",
NKF::EUC => "EUC",
NKF::SJIS => "SJIS",
NKF::BINARY => "BINARY",
NKF::ASCII => "ASCII",
NKF::UTF8 => "UTF8"
}
hash = Hash.new(0)
sum = Array.new()
begin
i = 0
rows = CSV.open(ARGV[0], 'r')
rows.shift # the first line is header, so it is omitted
rows.each do |row|
body = row[10]
case body
when /\[Group/
hash["group"] += 1
when /\[Tag.*joined\]/
hash["joined"] += 1
else
code = CODES.fetch(NKF.guess(body)) unless body.nil?
hash[code] += 1
if code != "ASCII" then
u = 0
body.split("").each do |b|
if (b[0].to_i < 33) || (b[0].to_i > 127)
u += 1
end
end
puts " utf8.size, body.size, ratio #{u}, #{body.size}, #{u * 100.0 / body.size}" if code != "ASCII"
end
end
end
rescue
ensure
puts "English/All = #{hash["ASCII"] * 100.0 / (hash["UTF8"] + hash["ASCII"] + 1)}"
hash.each do |key, value|
puts "#{key}\t#{value}"
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment