Skip to content

Instantly share code, notes, and snippets.

@masao
Last active February 8, 2019 11:54
Show Gist options
  • Save masao/e2726f20f2a0afa2c4ce17f16646fbb6 to your computer and use it in GitHub Desktop.
Save masao/e2726f20f2a0afa2c4ce17f16646fbb6 to your computer and use it in GitHub Desktop.
Extract terms and its frequencies with titles & publication years.
#!/usr/bin/env ruby
require 'lingua/stemmer'
$:.push "#{ENV["HOME"]}/.ruby"
require "kendall.rb"
total = {}
hash = {}
STOPWORDS = [
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "such",
"that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with",
#SIGIR specific
"tois"
]
ARGF.each do |line|
year, title, = line.strip.split(/ /)
hash[year] ||= {}
next if title.nil?
title = title.gsub(/[:\.\&\(\)\[\]\?;,]/, " ")
title.split.each do |term|
next if STOPWORDS.include? term.downcase
next if term =~ /\A[0-9\-\+]+\Z/
stem_term = Lingua.stemmer(term)
stem_term = stem_term.capitalize
total[stem_term] ||= 0
total[stem_term] += 1
hash[year][stem_term] ||= 0
hash[year][stem_term] += 1
end
end
years = hash.keys.sort
case_year = {}
years.each_with_index do |y, index|
case_year[index] = y
end
total.keys.sort_by{|k| - total[k] }.each do |k|
case_term = {}
years.each_with_index do |y, index|
case_term[index] = hash[y][k].to_i
end
puts [
k,
total[k],
hash.keys.sort_by{|e| e.to_i }.map{|y| hash[y][k] }.join(", "),
spearman(case_year, case_term)
].join("\t")
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment