Skip to content

Instantly share code, notes, and snippets.

@kentbye
Last active January 3, 2016 09:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kentbye/8442308 to your computer and use it in GitHub Desktop.
Save kentbye/8442308 to your computer and use it in GitHub Desktop.
This is a Nokogiri to strip out a specific class from a HTML page -- specifically the '.node-blog-post' class in this case, which is how Drupal labels a blog post. You'll need to scrape all of the HTML files into a folder, and then set up the directory structure as described in the notes.
#!/usr/bin/env ruby
require 'nokogiri'
# This script will grab the main content out of a Drupal blog post with class of '.node-blog-post',
# and then write the cleaned HTML files to a new directory. The header, sidebar, footer will
# all be removed.
#
# To use, first download set of blog post nodes from your site.
# For example, I created a view of the node ids (nid) from all of the blog posts.
# Then you can create a space-delimited shell command with the command of
# 'wget http://puppetlabs.com/node/2924;'
# Create an EXTRACTHTML with the folder of downloadedhtml and a TEMP directory
# Remove the non-html *.sh files from the downloadedhtml folder.
# Make a copy of the downloadedhtml into the TEMP directory.
#
# There will be a couple of input files that you'll need to create, but the final
# directory should look like this:
#
# EXTRACTHTML
# |--- downloadedhtml # Copied directory and files so that the ruby script can overwrite the files
# |--- TEMP
# |--- downloadedhtml # Directory with original data
# |--- extract-drupal-blog-content.rb # That is this file
# |--- input-files.txt # A pruned list of HTML files
# |--- single-file.txt # A single file to use to debug the primary content selector
#
# To create the list of files to scrape, run this command:
# $find downloadedhtml -type f > input-files.txt;
#
# Confirm that there is a carriage return after the last entry, otherwise you may get an error like:
# ./extract-drupal-blog-content.rb:24:in `initialize': No such file or directory - downloadedhtml/88 (Errno::ENOENT)
#
# You should be ready to either debug the primarycontent selector, or to scrape the input files.
# cd to TEMP directory and run ./extract-drupal-blog-content.rb to execute this file
# Opens up the list of list to iterate through
filename = 'input-files.txt' # Comment this line out if debugging the primarycontent SELECTOR
# If you're altering the primarycontent selector, then first create single-file.txt by
# cp input-files.txt single-file.txt;
# Open single-file.txt and delete all but the first line
# Opens up the list of list to iterate through
# filename = 'single-file.txt' # Uncomment this line for debugging the primarycontent SELECTOR
File.open(filename, 'r').each_line do |line|
# Provide feedback as to which file is actively being parsed
puts line[0..-2]
# Open up the file that is passed in through the input of the script
f = File.open(line[0..-2])
doc = Nokogiri::HTML(f)
# SELECTOR: Select the main content of each blog post
primarycontent = doc.css('.node-blog-post')
# puts primarycontent # Uncomment this line to debug the primarycontent selection
# Close out the original file
f.close()
# Create a new file with the filename entered as an argument and prepend it with ebook
new = File.open("../" + line[0..-2], "w")
# Write the first instance of primarycontent. The second instance is erroneous
new.write(primarycontent[0])
# Indicate to the command line that this file is finished to help debug whether a file crashes.
puts line[0..-2] + " FINISHED"
end
@kentbye
Copy link
Author

kentbye commented Jan 16, 2014

If you're trying to get a list item element with Nokogiri, then use this on line 56

primarycontent = doc.search('li')

@kentbye
Copy link
Author

kentbye commented Jan 16, 2014

If you want to write out each li item on a new line, and also delete the

  • and
  • text, then use this code snippet on line 66

    for i in (0..primarycontent.length-1)
      new.puts(primarycontent[i].inner_html)
    end
    

    @kentbye
    Copy link
    Author

    kentbye commented Jan 16, 2014

    To prepend the filename to all of the lines of scraped HTML, then first CD into the final directory with the newly written files and run the following two commands

    Put the file name at the end of the file.

    for f in * ; do cat $f | sed 's/$/ '$f'/'  >> all_files.txt; done
    

    Some sed magic via binford2k to move the filename of form "_####" from the end to the beginning and delimit it with a pipe

    sed -E 's/^(.*) ([0-9]*)$/\2|\1/' all_files.txt > files.txt
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment