Skip to content

Instantly share code, notes, and snippets.

@yoshikischmitz
Last active December 28, 2015 22:58
Show Gist options
  • Save yoshikischmitz/7575048 to your computer and use it in GitHub Desktop.
Save yoshikischmitz/7575048 to your computer and use it in GitHub Desktop.
#used ruby version 2.0.0, probably works on 1.9.x as well, may not work on 1.8.x
require 'mechanize'
mech = Mechanize.new
page = mech.get("http://www.europarl.europa.eu/meps/en/full-list.html?filter=all&leg=")
mep_links = []
page.links.each do |x| #a Mechanize::Page object provides us with all the links on the downloaded page
link = x.href
unless mep_links.include? link #avoid duplicate as there is one link in the name and another in the photo
mep_links << link if link =~ /\/meps\/\w+\/\d+\// #format for a MEP page link is /meps/de/34234238/..etc..
end
end
mep_links.each_with_index do |url,index| #visit every url in our list
page = mech.get(url)
#we'll used XPATHs to extract the exact text we need from the pages. Mechanize provides us access to XPATHs through the search()
#function, which is a wrapper function around Nokogiri's xpath function, which just means we don't have to instantiate a new
#nokogiri object on our own.
first_name = page.search("//li[@class='mep_name']/text()[1]").text #the name is held in the li tag with the class "mep_name"
last_name = page.search("//li[@class='mep_name']/text()[2]").text #by using text()[1] we get the text before the break
#and text()[2] gives us the name after the break. Note that
#technically mep_name should be an id instead of a class since
#it's unique and non-repeating
hometown = page.search("//*[@id='zone_before_content_global']/div/div[1]/ul/span[2]").text #I think this is the hometown?
hometown = hometown.match(/(?<=, ).+/) #everything after the comma is the home-town
puts "#{first_name}\t#{last_name}\t#{hometown}" #by separating with a tab we can be lazy and copy the results from the console to excel
break if index == 2 #this is to stop the script from running on all 766 results
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment