Skip to content

Instantly share code, notes, and snippets.

@hakanai
Created April 3, 2010 11:13
Show Gist options
  • Save hakanai/354391 to your computer and use it in GitHub Desktop.
Save hakanai/354391 to your computer and use it in GitHub Desktop.
#!/usr/bin/env ruby
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
require 'csv'
SOURCES = [
"http://tokyotosho.info/rss.php?filter=1",
"http://tokyotosho.info/rss.php?filter=4,11",
"http://www.animesuki.com/rss.php"
]
CSV_URL = 'http://spreadsheets.google.com/pub?key=tGprshqJ6POOL2t81LPC-8Q&output=csv'
items = SOURCES.map do |source|
feed = open(source) { |s| RSS::Parser.parse(s.read, false) }
feed.items
end.flatten
keep_titles = []
discard_titles = []
title_column = nil
action_column = nil
CSV::Reader.parse(open(CSV_URL)) do |row|
if !title_column
title_column = row.index('Title')
action_column = row.index('Action')
next
end
case row[action_column]
when 'Keep'
keep_titles << row[title_column]
when 'Discard'
discard_titles << row[title_column]
else
raise RuntimeError, "Unknown action #{row[action_column]}"
end
end
keep_regex = Regexp.new(".*(#{keep_titles.map { |s| s.downcase }.join('|')}).*", true)
discard_regex = Regexp.new(".*(#{discard_titles.map { |s| s.downcase }.join('|')}).*", true)
# fix badly-behaved sites which link to a listing page
items.each do |item|
item.link = item.enclosure.url if item.enclosure && item.enclosure.url
end
# de-dup by link - don't want the same file twice because it will mess up dedup later on
items = items.group_by { |item| item.link }.values.map { |arr| arr[0] }
# sort into the order we want
items = items.sort_by { |item| item.pubDate }.reverse
( do_want, tmp ) = items.partition { |item| item.title =~ keep_regex }
( do_not_want, not_sure_if_want ) = tmp.partition { |item| item.title =~ discard_regex }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment