Skip to content

Instantly share code, notes, and snippets.

Last active December 28, 2015 22:58
Show Gist options
  • Save yoshikischmitz/7575048 to your computer and use it in GitHub Desktop.
Save yoshikischmitz/7575048 to your computer and use it in GitHub Desktop.
#used ruby version 2.0.0, probably works on 1.9.x as well, may not work on 1.8.x
require 'mechanize'
mech =
page = mech.get("")
mep_links = []
page.links.each do |x| #a Mechanize::Page object provides us with all the links on the downloaded page
link = x.href
unless mep_links.include? link #avoid duplicate as there is one link in the name and another in the photo
mep_links << link if link =~ /\/meps\/\w+\/\d+\// #format for a MEP page link is /meps/de/34234238/..etc..
mep_links.each_with_index do |url,index| #visit every url in our list
page = mech.get(url)
#we'll used XPATHs to extract the exact text we need from the pages. Mechanize provides us access to XPATHs through the search()
#function, which is a wrapper function around Nokogiri's xpath function, which just means we don't have to instantiate a new
#nokogiri object on our own.
first_name ="//li[@class='mep_name']/text()[1]").text #the name is held in the li tag with the class "mep_name"
last_name ="//li[@class='mep_name']/text()[2]").text #by using text()[1] we get the text before the break
#and text()[2] gives us the name after the break. Note that
#technically mep_name should be an id instead of a class since
#it's unique and non-repeating
hometown ="//*[@id='zone_before_content_global']/div/div[1]/ul/span[2]").text #I think this is the hometown?
hometown = hometown.match(/(?<=, ).+/) #everything after the comma is the home-town
puts "#{first_name}\t#{last_name}\t#{hometown}" #by separating with a tab we can be lazy and copy the results from the console to excel
break if index == 2 #this is to stop the script from running on all 766 results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment