Skip to content

Instantly share code, notes, and snippets.

@igaiga
Last active July 9, 2017 04:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save igaiga/025c4e15bb3458ce4bd131b37132ca06 to your computer and use it in GitHub Desktop.
Save igaiga/025c4e15bb3458ce4bd131b37132ca06 to your computer and use it in GitHub Desktop.
## Analyze Wikipedia access data
# Data format
# https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews
# Files
# https://dumps.wikimedia.org/other/pageviews/
# "ja "はじまりのものだけ(=Wikipediaだけ)をカウントしている。
# ja.X の説明は上記のData format参照。
## 結果(pageviews-20170101-000000の駅名上位10件)
# 八丁堀駅 12
# 品川駅 9
# 赤岩駅 8
# 池袋駅 8
# 摂津市駅 7
# 下灘駅 7
# 東陽町駅 7
# 長野駅 6
# 葛西駅 5
# 高幡不動駅 5
#file_name = "sample.txt"
file_name = "pageviews-20170101-000000"
# https://dumps.wikimedia.org/other/pageviews/2017/2017-01/
selected_data = []
File.open(file_name) do |f|
f.each_line do |line|
data = line.split
handy_data = {domain: data[0], title: data[1], access_count: data[2]}
if handy_data[:domain] == "ja" && handy_data[:title] =~ /駅\z/
selected_data << handy_data
end
end
end
# access_count順に並べ替え
sorted_data = selected_data.sort_by do |x|
x[:access_count].to_i
end
# 上位2件を表示
sorted_data.reverse.first(10).each do |x|
puts "#{x[:title]} #{x[:access_count]}"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment