Skip to content

Instantly share code, notes, and snippets.

@nebuta
Created November 8, 2011 07:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nebuta/1347244 to your computer and use it in GitHub Desktop.
Save nebuta/1347244 to your computer and use it in GitHub Desktop.
Make a dictionary for ASCII
require 'rubygems'
require 'hpricot'
require 'iconv'
$vectorascii = Array.new(65536).fill(0)
$pwd = ""
def normalize
norm = 65536
sqsum = $vectorascii.inject(0){|sum,e| sum += e*e}
factor = Math.sqrt(sqsum)
$vectorascii.map!{|e| e.to_f * norm / factor}
end
def print_vector
Dir::chdir($pwd)
open("vector_ascii.txt",'w'){|out|
for i in 0..255
start = i*256
row = ($vectorascii[start,256])
out.puts(row.map{|e| "%.5f"%e}.join("¥t"))
end
}
end
$filecount = 0
def analyze(text)
$filecount = $filecount + 1
b = text.unpack('C*')
for i in 0..(b.length-2)
$vectorascii[b[i]*256+b[i+1]] += 1
end
end
def read_rfc(file)
text = IO.read(file)
analyze(text)
end
def main
$pwd = Dir::pwd
Dir::chdir("rfcdatabase")
Dir::glob("rfc*.txt").each{|file|
$stderr.puts File.basename(file)
read_rfc(file)
}
puts "Total count: " + $vectorascii.inject(0){|sum,e| sum += e}.to_s
normalize
print_vector
end
main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment