Skip to content

Instantly share code, notes, and snippets.

@jewel12
Created October 19, 2010 13:38
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save jewel12/634199 to your computer and use it in GitHub Desktop.
require 'tokyocabinet'
include TokyoCabinet
N = 5
class NgramCounter
def initialize(f_name)
@tcdb = TokyoCabinet::ADB::new
@tcdb.open("*") # オンメモリ
end
def inc( key )
@tcdb.addint(key, 1)
end
def print_all_freq( f )
keys = @tcdb.keys
keys.each do |key|
f.puts "#{key} : #{@tcdb[key].unpack('i').first}"
end
end
def close_db
@tcdb.close
end
end
if __FILE__ == $0
f_name = ARGV[0].chomp
ngram_counter = []
N.times{ |i| ngram_counter << NgramCounter.new(i.to_s) }
open( f_name ).readlines.each do |line|
data = line.split("\s")
data.size.times do |i|
N.times do |n|
ngram_counter[n].inc( data[i..(i+n)].join("\s") ) unless (i+n) >= data.size
end
end
end
N.times do |n|
open("#{f_name}.ngram.#{n+1}", 'w') do |f|
ngram_counter[n].print_all_freq( f )
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment