Skip to content

Instantly share code, notes, and snippets.

@rafapolo
Last active August 15, 2020 08:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rafapolo/5f4dee064f852079eaad15bbfeef2948 to your computer and use it in GitHub Desktop.
Save rafapolo/5f4dee064f852079eaad15bbfeef2948 to your computer and use it in GitHub Desktop.
Lista Filmografia da Cinemateca Nacional (Hey, Bolsonaro, vai tomar no cu!)
#!/usr/bin/ruby
# extrapolo.com
require 'selenium-webdriver'
require "nokogiri"
def driver
Selenium::WebDriver.for :firefox, options: Selenium::WebDriver::Firefox::Options.new(
args: ['-headless']
)
end
def clean_filmografia_item(html)
html = html.gsub("<br>", "\n").gsub("</div>", "\n").gsub("</b>", "\n")
text = Nokogiri::HTML(html).text
text = text.gsub(/\R+/, "\n").gsub!("\n\s", "\n").gsub(":\n", ": ").gsub(": \n", ": ")
text + "\n</item>\n\n"
end
def get_filmografia
browser = driver
browser.navigate.to('http://bases.cinemateca.gov.br/cgi-bin/wxis.exe/iah/?IsisScript=iah/iah.xis&base=FILMOGRAFIA&lang=p')
browser.find_element(name: 'search').submit
c=0
(1..5058).each do |page|
puts "page #{page}"
browser.find_element(name: "Page#{page}").click if page > 1
next_el = nil
while !next_el
next_el = browser.find_elements(id: "filme-#{1+c}").count > 0 # exists?
sleep 1
end
(1..10).each do |i|
el = browser.find_element(id: "filme-#{i+c}")
html = el.attribute("innerHTML")
data = clean_filmografia_item(html)
save("filmografia.txt", data)
end
c+=10
end
end
def save(file, text)
open(file, 'a') do |f|
f.puts text
end
end
puts "starting..."
get_filmografia
@rafapolo
Copy link
Author

=> 47MB de 50608 filmes em http://extrapolo.com/projeto/cinemateca/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment