Skip to content

Instantly share code, notes, and snippets.

@stephenmac7
Last active September 11, 2015 00:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save stephenmac7/fa430fb3b3cfc033398a to your computer and use it in GitHub Desktop.
Save stephenmac7/fa430fb3b3cfc033398a to your computer and use it in GitHub Desktop.
Lemma Frequency
# Gem Depends: ve, docopt
# System Depends: mecab, mecab-ipadic-utf-8
require 'csv'
require 've'
require 'docopt'
doc = <<DOCOPT
Lemma Frequency Report.
Usage:
#{__FILE__} [options] FILE ...
#{__FILE__} -h | --help
#{__FILE__} --version
Options:
-h --help Show this screen.
-m --morpheme Target morphemes, instead of lexemes.
--version Show version.
DOCOPT
def main(opt)
# Input from args, UTF-8 required
contents = ''
opt['FILE'].each do |f|
if f == '-'
f = '/dev/stdin'
end
contents << File.read(f)
end
# Pre-processing
lines = remove_rubies(contents).split # We need to give mecab bite-sized
# pieces, because pipes can't handle
# big sizes and ve uses pipes
# Process the text and count lemmas, this might take a while
freq = calculate_frequency(lines, opt['--morpheme'])
# Show count
show_count(freq)
end
def calculate_frequency(lines, morpheme)
# Creates a hash with the frequency for all the lines
lines.reduce(Hash.new(0)) do |freq,line|
ve_line = filter_blacklisted(Ve.in(:ja).words(line))
get_frequency_hash(ve_line, morpheme, freq)
end
end
def remove_rubies(text)
# For Aozora Bunko text as input, rubies need to be removed
text.gsub(/《.*》/, "")
end
# For morpheme operations, it would be much faster to use mecab directly
def get_frequency_hash(words, morpheme, freq = Hash.new(0))
words.each do |word|
unless word.lemma == "*" # if lemma could not be found, don't count
if morpheme
word.tokens.each do |token|
index = [token[:lemma], token[:pos]]
freq[index] += 1
end
else
index = [word.lemma, word.part_of_speech.name]
freq[index] += 1
end
end
end
freq
end
def filter_blacklisted(words)
pos_blacklist = [Ve::PartOfSpeech::Symbol, Ve::PartOfSpeech::ProperNoun]
words.select { |word| not pos_blacklist.include? word.part_of_speech }
end
def show_count(counts)
counts.sort_by{|_,count| count}.reverse.each do |ind,count|
print [count, ind.first, ind.last].to_csv
end
end
if __FILE__==$0
begin
main Docopt::docopt(doc, version: '0.0.1')
rescue Docopt::Exit => e
puts e.message
end
end
@fasiha
Copy link

fasiha commented Jun 26, 2015

Awesome! I put 坊ちゃん through it—well, I'm trying to, it's been doing something for a few minutes. CPU and memory are all nominal, and MeCab itself can chew through the entire file in ~quarter-second, why do you think freq.rb takes so much longer? Spawning a process for each line? Slow Ve logic? Expensive histograms?

Besides speed, I'd love to be able to store not just the lemmas but part-of-speech as well, at least till we establish how many distinct parts-of-speech the same lemma can have. I'd also like to get histograms of MeCab as well as Ve: for some applications, it's better to have small morpheme-level of granularity, even if many of the morphemes are, say, used in conjugations.

On a larger scale, if you wanted to run this on each file in a whole directory structure, in order to later generate composite histograms of subsets of files, what would be the ideal way to store the per-file data? Would it still be TSV, as it is now? Would there be any benefit in storing the results in a database, even something light like SQLite? I ask because I could generate a histogram report for each file in a directory tree, and then load subsets of them into node or Python for combining—if I wanted to do that many times, with different subsets of files, I wonder if a more heavyweight solution is appropriate. Well, one way to find out: let's go download a bunch of text files and measure some histograms!

@fasiha
Copy link

fasiha commented Jun 26, 2015

I should have noted that Ve includes the MeCab results for each Ve lemma in a tokens member, so we do have easy access to prettified MeCab output.

@stephenmac7
Copy link
Author

So, I did profiling, etc. and ve is the bottleneck for speed. However, the issue you encountered was that the size of the file was greater than the max file size for a pipe (which ve uses to feed information to mecab). For the moment, I have fixed the issues by feeding mecab line-by-line, which doesn't seem to have too much overhead. I've also added an option to use the mecab lemmas and part of speech.

Also, it no longer combines identical lemmas that seem to have different parts of speech. This did in fact change some of the counts. For example, using your file, the の lemma lost a few hundred for its count.

The output is now CSV for easier processing and because, upon adding part of speech, I realized that it would be impractical to try to print double-width unicode in columns on a terminal screen so that it would be lined up. Maybe we can add some sort of pretty printing later.

So, you should be able to run the program on that book in a few seconds (up to 10, I would say). Any longer, and something has gone wrong.

Also, as you may notice in the comment in the file, if we're only counting morphemes, it would be better to just directly access mecab and completely skip ve. If that's going to be a commonly used feature, I'll make sure that happens. However, I would like to ask the author of ve if he can figure out how to fix the piping issue first.

To store the data, redirect the output and save it as CSV. Just about any programming language has easy facilities to read and access that untyped data. For now, this should be fine. At the time, I don't see any significant benefits in storing the results in a database. If you have a good reason, I'd be glad to implement it.

Concerning overall speed, if your data is going to be huge, we might need to get ve optimized (and fixed) or rewrite it in a more efficient language/way. However, if the current speed is fine, then I would suggest against doing work that has already been done :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment