Last active
August 29, 2015 14:15
-
-
Save egardner/e80518a62e3d4acdcb32 to your computer and use it in GitHub Desktop.
Simple ruby web scraper with Nokogiri. Designed for use with the pages in the Getty virtual library website.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'nokogiri' | |
require 'open-uri' | |
require 'json' | |
class Book | |
attr_accessor :isbn | |
attr_accessor :title | |
attr_accessor :author | |
attr_accessor :keywords | |
attr_accessor :type | |
attr_accessor :year | |
attr_accessor :page_count | |
attr_accessor :description | |
attr_accessor :imprint | |
def initialize(url) | |
page = Nokogiri::HTML(open(url)) | |
@isbn = page.css('meta[name="isbn"]').attribute("content").text | |
@title = page.css('meta[name="title"]').attribute("content").text | |
@author = page.css('meta[name="author"]').attribute("content").text | |
@keywords = page.css('meta[name="keywords"]').attribute("content").text | |
@imprint = page.css('meta[name="imprint"]').attribute("content").text | |
@type = page.css('meta[name="type"]').attribute("content").text | |
@year = page.css('meta[name="resultyear"]').attribute("content").text | |
@page_count = page.css('#item-info p')[2].text | |
@description = page.css('#desc-content p') | |
.to_s.force_encoding("ISO-8859-1").encode("UTF-8") | |
.delete("\n") | |
.gsub("\u0097", "—") | |
end | |
def display | |
instance_variables.each do |var| | |
puts (self.instance_variable_get var) | |
end | |
end | |
def to_hash | |
hash = {} | |
instance_variables.each do |var| | |
hash[var.to_s.delete("@")] = self.instance_variable_get(var) | |
end | |
return hash | |
end | |
end |
Right now the initialize method is manually gsub
-ing out weirdly encoded characters. Don't know if a set will have to be added one by one or if there is a better way to handle this.
Also, wherever possible, extract info from <meta>
tags, not from actual text.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
TODO:
- retain formatting (currently this is stripping out things like italics)output results as JSON