Created
November 11, 2014 20:18
-
-
Save miharekar/2ad9c28a8078cad8302f to your computer and use it in GitHub Desktop.
apparatus twitter usernames
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'nokogiri' | |
require 'open-uri' | |
require 'json' | |
require 'csv' | |
URL = 'http://apparatus.si/oddaja/pogovor/page/%d/' | |
def last_page | |
doc = Nokogiri::HTML(open(URL%1)) | |
pagination = doc.css('.archive-pagination a') | |
link = pagination[-2].attr('href').gsub(/\D/, '').to_i | |
end | |
def shows | |
1.upto(last_page).map { |page| | |
Nokogiri::HTML(open(URL%page)).css('h1.entry-title a') | |
}.flatten.map { |show| | |
regex = show.text.match(/^(.*): (.*)$/) | |
{ | |
no: regex[1].to_i, | |
person: regex[2], | |
link: show.attr('href') | |
} | |
}.reverse | |
end | |
def get_user(link) | |
url = URI(link.attr('href')) | |
user = url.path.sub('/', '') | |
end | |
def get_twitter_links(html) | |
html.css('a').select{ |link| | |
link.attr('href') =~ /twitter.com/ | |
}.reject{ |link| | |
%w(anzet apparatus_si).include?(get_user(link).downcase) || link.attr('href') =~ /share/ | |
} | |
end | |
twitter = shows.map { |show| | |
html = Nokogiri::HTML(open(show[:link])) | |
links = get_twitter_links(html) | |
if links.first | |
show.merge(twitter: get_user(links.first)) | |
else | |
show | |
end | |
} | |
CSV.open('twitter.csv', 'w') do |csv| | |
csv << twitter.last.keys | |
twitter.each do |hash| | |
csv << hash.values | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Get all twitter usernames of people who were on Apparatus podcast/Storming Mortal for fun and
profit.Take the first twitter link from show page that's not anzet|apparatus|share link. As with every 80/20 concept it works for majority, but fails miserably for minority which also includes me - I get parishilton 😆