Skip to content

Instantly share code, notes, and snippets.

@malev
Created May 13, 2014 16:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save malev/40cc453e1c6c15db102f to your computer and use it in GitHub Desktop.
Save malev/40cc453e1c6c15db102f to your computer and use it in GitHub Desktop.
Scrapper for list of names in BA
# encoding: UTF-8
require 'open-uri'
require 'nokogiri'
require 'csv'
def gen_url(offset=0)
if offset == 0
"http://www.buenosaires.gob.ar/areas/registrocivil/nombres/busqueda/buscador_nombres.php?&menu_id=16082"
else
"http://www.buenosaires.gob.ar/areas/registrocivil/nombres/busqueda/buscador_nombres.php?offset=#{offset}&menu_id=16082"
end
end
def scrap
range = (0..9600)
range.step(150).each_slice(10) do |group|
group.each do |offset|
threads = []
threads << Thread.new{
puts "Working on #{offset}"
File.open("#{offset}.html", "w") do |file|
file.write(open(gen_url(offset)).read)
end
}
threads.each(&:join)
puts "finish with group"
end
end
end
output = []
headers = ["Name","years.appearing","count.male","count.female","prob.gender","obs.male","est.male","upper,lower"]
def get_gender(a)
if a == "M"
'male'
elsif a == "F"
'female'
else
'unknown'
end
end
CSV.open("output.csv", "w:iso-8859-1") do |csv|
csv << headers
Dir['html/*.html'].sort.each do |file|
range = (0..9600)
range.step(150).each do |num|
filename = "html/#{num}.html"
if File.exists?(filename)
html = open(filename, "r:iso-8859-1").read
doc = Nokogiri::HTML(html)
doc.css('table.contenido tbody tr').each do |trow|
begin
csv << [trow.children[0].text, nil,nil,nil,get_gender(trow.children[1].text)]
rescue => e
puts e, trow.children[0].text
end
end
end
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment