Skip to content

Instantly share code, notes, and snippets.

@mattkanwisher
Last active September 9, 2017 14:17
Show Gist options
  • Save mattkanwisher/549724ba67a2e231e7a4cda20e102fd1 to your computer and use it in GitHub Desktop.
Save mattkanwisher/549724ba67a2e231e7a4cda20e102fd1 to your computer and use it in GitHub Desktop.
Extract word frequency list from subtitle file from viki.com and output csv
#this program takes Web Subtitle files from sites like Viki.com and then gets the
#occurrence frequency and outputs a csv, to later be made into an Anki deck
require "webvtt"
require "ffi-icu"
$words = []
def add_word(x)
if x != "." && x != "," && x != " " && x != "-" && x != "!"
$words << x
end
end
def split_words(line)
iterator = ICU::BreakIterator.new(:word, "th_TH")
iterator.text = line
iterator.each_substring { |x| add_word(x) }
end
def extract()
webvtt = WebVTT.read(ARGF.argv[0])
webvtt.cues.each do |cue|
text = cue.text
text = text.gsub("<i>", "").gsub("</i>", "")
split_words(text)
end
res = $words.each_with_object(Hash.new(0)) { |word,counts| counts[word] += 1 }
res.sort_by{ |k,v| -1 * v}.each do |word, val|
puts "#{word}, #{val}"
end
end
extract()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment