Skip to content

Instantly share code, notes, and snippets.

@kentbye
Last active December 29, 2015 03:39
Show Gist options
  • Save kentbye/7609727 to your computer and use it in GitHub Desktop.
Save kentbye/7609727 to your computer and use it in GitHub Desktop.
This script removes the sidebar and navigation information from the HTML files of the Puppet Labs documentation in the "puppetdocs-latest" folder.It uses Nokogiri to select all of the content in div with a "primary-content" class, strips out the last "Back to top" text at the bottom, and then writes the data to a separate folder.
#!/usr/bin/env ruby
require 'nokogiri'
# This script will grab the main content out of the Puppet Labs documentation,
# and write the cleaned HTML files to a new directory.
#
# To use, first download http://docs.puppetlabs.com/puppetdocs-latest.tar.gz
# Create a TEMP folder at the top-level of a directory
# Unzip the puppetdocs-latest at the top-level, and then make a copy into the TEMP directory.
#
# puppetdocs-latest # Copied directory and files so that the ruby script can overwrite the files
# TEMP
# |--- puppetdocs-latest # Directory with original data
# |--- extract-content.rb
# |--- html-input-files.txt # A pruned list of HTML files
# To create the html-input-files.txt, then run this command:
# $find puppetdocs-latest -type f -name "*.html" > html-input-files.txt
filename = 'html-input-files.txt'
File.open(filename, 'r').each_line do |line|
puts line[0..-2]
# Open up the file that is passed in through the input of the script
f = File.open(line[0..-2])
doc = Nokogiri::XML(f)
# Select all of the div content that has a class of primary-content
primarycontent = doc.css('.primary-content')
# Remove "Back to Top" link at the bottom of the page.
links = primarycontent.xpath('//blockquote/p/a')
if !primarycontent.empty? then
if links[links.length-1].inner_html == "↑ Back to top"
links[links.length-1].remove
end
end
# Close out the original file
f.close()
# Create a new file with the filename entered as an argument and prepend it with ebook
new = File.open("../" + line[0..-2], "w")
# Write the first instance of primarycontent. The second instance is erroneous
new.write(primarycontent[0])
puts line[0..-2] + " FINISHED"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment