Skip to content

Instantly share code, notes, and snippets.

@semperos
Created December 27, 2010 20:42
Show Gist options
  • Save semperos/756540 to your computer and use it in GitHub Desktop.
Save semperos/756540 to your computer and use it in GitHub Desktop.
JRuby script that logs you into Github and scrapes the "source" of your Wiki pages (i.e. what you typed) and saves it to a file.
require 'rubygems'
require 'celerity'
require 'hpricot'
require 'htmlentities'
# You obviously need all of the above gems installed before proceeding
user, password, project = ARGV # 'tobi', 'my_password', 'liquid'
raise(ArgumentError, "jruby scrape-github-wiki <username> <password> <projectname>") unless user and project and password
# Constants
WIKI_PAGES_URL = "https://github.com/#{user}/#{project}/wiki/_pages"
BASE_URL = "https://github.com"
# Start browser
puts "Starting headless browser and scraping wiki pages..."
@b = Celerity::Browser.new
@b.goto "https://github.com/login"
# For decoding
@ent = HTMLEntities.new
begin
@b.text_field(:name => 'login').set user
@b.text_field(:name => 'password').set password
@b.button(:value => "Log in").click
@b.goto WIKI_PAGES_URL
toc_links = @b.elements_by_xpath("//*[@id='guides']/div/div[contains(@class, 'wikistyle')]/ul/li/strong/a")
wiki_links = []
# We have to get the href's up front, because celerity
# won't find the elements in its cache once we navigate away
toc_links.each do |l|
wiki_links << BASE_URL + l.href
end
wiki_text = ''
wiki_links.each do |l|
@b.goto l
@b.link(:class => /btn-edit/).click
wiki_page_title = @b.text_field(:id => "wiki_name").text
puts "Scraping wiki page with title: #{wiki_page_title}"
wiki_text << "Wiki Page Title: #{wiki_page_title}\n"
wiki_text << @ent.decode(Hpricot(@b.html).at("#wiki_body").inner_html)
wiki_text << "\n\n" + ("#" * 80) + "\n\n"
end
rescue StandardError => e
puts "An error occurred: " + e
ensure
@b.close
end
puts "Saving wiki page source to 'github_wiki_pages.txt'..."
File.open('github_wiki_pages.txt', 'w') { |f| f.write(wiki_text)}
puts "\nDone\n"
@atmos
Copy link

atmos commented Dec 27, 2010

You can also clone your wiki like a normal github repo, it's a lot simpler. :)

@maddox
Copy link

maddox commented Dec 27, 2010

check out my fork

@semperos
Copy link
Author

Absolutely true :) There I go forgetting that they're repo's. Though it represents a fine "hello world" of using the Watir/Celerity API and some super-simple Hpricot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment