Skip to content

Instantly share code, notes, and snippets.

@egardner
Last active August 29, 2015 14:15
Show Gist options
  • Save egardner/e80518a62e3d4acdcb32 to your computer and use it in GitHub Desktop.
Save egardner/e80518a62e3d4acdcb32 to your computer and use it in GitHub Desktop.
Simple ruby web scraper with Nokogiri. Designed for use with the pages in the Getty virtual library website.
require 'nokogiri'
require 'open-uri'
require 'json'
class Book
attr_accessor :isbn
attr_accessor :title
attr_accessor :author
attr_accessor :keywords
attr_accessor :type
attr_accessor :year
attr_accessor :page_count
attr_accessor :description
attr_accessor :imprint
def initialize(url)
page = Nokogiri::HTML(open(url))
@isbn = page.css('meta[name="isbn"]').attribute("content").text
@title = page.css('meta[name="title"]').attribute("content").text
@author = page.css('meta[name="author"]').attribute("content").text
@keywords = page.css('meta[name="keywords"]').attribute("content").text
@imprint = page.css('meta[name="imprint"]').attribute("content").text
@type = page.css('meta[name="type"]').attribute("content").text
@year = page.css('meta[name="resultyear"]').attribute("content").text
@page_count = page.css('#item-info p')[2].text
@description = page.css('#desc-content p')
.to_s.force_encoding("ISO-8859-1").encode("UTF-8")
.delete("\n")
.gsub("\u0097", "—")
end
def display
instance_variables.each do |var|
puts (self.instance_variable_get var)
end
end
def to_hash
hash = {}
instance_variables.each do |var|
hash[var.to_s.delete("@")] = self.instance_variable_get(var)
end
return hash
end
end
@egardner
Copy link
Author

TODO:
- retain formatting (currently this is stripping out things like italics)
output results as JSON

  • accept an array of arguments or read a list of ISBNs from a separate file, logging each book as it goes

@egardner
Copy link
Author

Right now the initialize method is manually gsub-ing out weirdly encoded characters. Don't know if a set will have to be added one by one or if there is a better way to handle this.

Also, wherever possible, extract info from <meta> tags, not from actual text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment