Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@SabretWoW
Created December 9, 2013 22:44
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save SabretWoW/7882406 to your computer and use it in GitHub Desktop.
Save SabretWoW/7882406 to your computer and use it in GitHub Desktop.
A Ruby web scraping script that visits a GitHub trending repos page, scrapes the data for the 25 repos, loads them into a CSV, and then reads from the CSV & creates a hash with each repo's data.
require 'mechanize'
require 'csv'
# Load up the trending Ruby repos on GitHub from the last month.
url_to_scrape = "https://github.com/trending?l=ruby&since=monthly"
# Snag the website with Mechanize & parse it into an XML document we can query.
page = Mechanize.new.get(url_to_scrape)
# Set the name of the CSV we'll create & load from.
file = "repo_data.csv"
# Array for the repo data we'll generate.
repos = []
# Create a CSV file & et it know to write a header row.
CSV.open(file,'w',
write_headers: true,
headers: ["rank", "author", "repo_name", "description"]) do |csv|
puts "Opened #{file} for writing...\n\n"
# Parse out each of the 25 small <div> elements that contain the repo's data.
page.search("ol li.repo-leaderboard-list-item").each do |repo, i|
# Create a temporary variable for each piece of repo data we want.
rank = repo.search("a.leaderboard-list-rank").inner_text.to_s.to_i
author = repo.search("span.owner-name").inner_text
repo_name = repo.search("a.repository-name strong").inner_text
description = repo.search("p.repo-leaderboard-description").inner_text
# Build an array of the repo data for the current row.
repo_data = [rank, author, repo_name, description]
# Push the row onto the CSV
csv << repo_data
end
end
puts "Done writing to #{file} file...\n\n"
# Load a CSV file & loop through each row. Let it know there is a header row & each key is split by commas. That gives you convenience methods that make the CSV data almost behave like a Hash.
CSV.foreach(file, headers: true, col_sep: ',') do |row|
# I'm just creating a hash with each repo's information then pushing it to the repos array. You could do whatever you wanted to here: write data to a database, fire off API calls from the data bits, just output to the log, whatever!
repo = Hash.new()
repo[:rank] = row['rank']
repo[:author] = row['author']
repo[:repo_name] = row['repo_name']
repo[:description] = row['description']
# Print the hash just for shits & giggles.
puts "#{repo}\n\n"
# Push the newly-created hash into the array of repos.
repos << repo
end
# Success messages rock!
puts "Done!\n\n"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment