Skip to content

Instantly share code, notes, and snippets.

@gberger
Created January 22, 2014 15:31
Show Gist options
  • Save gberger/8560768 to your computer and use it in GitHub Desktop.
Save gberger/8560768 to your computer and use it in GitHub Desktop.
Download all SISU selected candidates pages. Parse them.
# Usage:
# ruby get.rb lower upper output-dir
# Example:
# ruby get.rb 65000 85000 out
require 'open-uri'
class String
def red; "\033[31m#{self}\033[0m" end
def green; "\033[32m#{self}\033[0m" end
end
(ARGV[0]..ARGV[1]).each do |n|
source = open("http://sisu.mec.gov.br/selecionados?co_oferta=#{n}").read
if source.include? 'A página que você tentou acessar está indisponível'
puts "Skipping #{n}".red
else
puts "Saving #{n}".green
file = File.new("#{ARGV[2]}/#{n}.html", "w")
file.puts source
file.close
end
end
require 'nokogiri'
require 'titleize'
class String
def clean
self.strip.gsub(/\s{2,}/, ' ')
end
end
Dir.glob('out/*').each do |filename|
page = Nokogiri::HTML(open(filename))
ies = page.css('.nome_ies_p').text.clean
campus = page.css('.nome_campus_p').text.clean
curso = page.css('.nome_curso_p').text.clean
turno = page.css('.grau_turno_p').text.clean
candidatos = page.css('.no_candidato').map { |cand| cand.text.clean }[1..-1]
# Now do something with this info! Save to DB, write to a file...
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment