Skip to content

Instantly share code, notes, and snippets.

@skateman
Last active February 10, 2017 09:49
Show Gist options
  • Save skateman/1a95713b726bc2352f019f43ef2e5af4 to your computer and use it in GitHub Desktop.
Save skateman/1a95713b726bc2352f019f43ef2e5af4 to your computer and use it in GitHub Desktop.
List of Czechoslovak agents operating between 1975-1989 in Western-Slovakia
require 'json'
require 'nokogiri'
require 'open-uri'
BASE = 'http://www.upn.gov.sk'.freeze
url = "#{BASE}/utvary-stb-a-ps-na-slovensku/zoznam-osob.php?pismeno=".freeze
agents = []
# Separate page for each letter
('A'..'Z').each do |letter|
begin
doc = Nokogiri::HTML(open([url, letter].join))
rescue Net::OpenTimeout, Net::ReadTimeout
retry
end
# Multiple pages per letter
loop do
doc.css('.zoznam-vysledkov tbody > tr').each do |row|
data = row.css('td').map(&:text)
detail = row.css('td a').last.attr('href').to_s
agent = {
:name => [data[1], data[0]].join(' '),
:first => data[1],
:last => data[0],
:born => data[2],
:id_1 => data[3],
:id_2 => data[4],
:code => data[5],
:url => "#{BASE}/utvary-stb-a-ps-na-slovensku/#{detail}"
}
agents << agent
end
# Jump to next page
page = doc.css('a:contains("nasledujúca")')
break if page.empty?
begin
doc = Nokogiri::HTML(open("#{BASE}#{page.attr('href')}"))
rescue Net::OpenTimeout, Net::ReadTimeout
retry
end
end
end
puts agents.to_json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment