Skip to content

Instantly share code, notes, and snippets.

@Jberczel
Last active August 29, 2015 14:11
Show Gist options
  • Save Jberczel/17cce0c5bc2cde4f79b9 to your computer and use it in GitHub Desktop.
Save Jberczel/17cce0c5bc2cde4f79b9 to your computer and use it in GitHub Desktop.
Extract scrape logic out of models
class AgfScraper < ForumScraper
attr_reader :forum
def post_initialize(args)
@forum = args[:forum] || default_forum
end
def page_count
@page_count ||=
content.css('td.vbmenu_control').text.split(' ').last.to_i
end
private
def parse_table(page_num)
data = page_data(page_url(page_num))
rows = table_rows(data)
remove_sticky_posts(rows, page_num)
end
def page_url(page_num)
"#{url}&page=#{page_num}"
end
def page_data(url)
Nokogiri::HTML(open(url))
end
def table_rows(data)
data.css("#threadbits_forum_#{forum} tr")
end
def remove_sticky_posts(rows, page_num)
page_num == 1 ? rows.drop(sticky_posts) : rows
end
def parse_single_post(row)
title = row.css('td')[2].css('div')[0].text.strip.gsub(/\s{2,}/, ' ')
author = row.css('td')[2].css('div')[1].text.strip.gsub(/\s{2,}/, ' ')
last_post = row.css('td')[3].text.strip.gsub(/\s{2,}/, ' ')
replies = row.css('td')[4].text
views = row.css('td')[5].text
link = row.at_css('td a')['href']
posts << OpenStruct.new(title: title, link: link, author: author,
last_post: last_post, replies: replies, views: views)
end
end
require 'open-uri'
require 'ostruct'
class ForumScraper
attr_reader :url, :sticky_posts, :posts, :content
def initialize(args={})
@url = args[:url] || default_url
@sticky_posts = args[:sticky] || default_sticky_posts
@posts = []
post_initialize(args)
end
def default_url
"http://www.google.com"
end
def default_sticky_posts
6
end
# subclasses may override
def post_initialize(args)
nil
end
def posts?
!posts.empty?
end
def content
@content ||= Nokogiri::HTML(open(url))
end
# initialize in subclasses
def page_count
raise NotImplementedError
end
def parse_pages
return "Scrape Failed" unless scrapable?
1.upto(page_count) do |i|
puts "\tparsing page #{i}..."
retry_parse(3) { parse_single_page(i) }
end
self
end
def create_posts(model)
return "No posts to create" unless posts?
model.clear_db
model.create(posts)
self
end
private
def scrapable?
content && page_count
end
def parse_single_page(page_num)
data_table = parse_table(page_num)
parse_posts(data_table)
end
# parse html table, page_num is url parameter
def parse_table(page_num)
raise NotImplementedError
end
# parse single row in html table
def parse_single_post(row)
raise NotImplementedError
end
def parse_posts(rows)
rows.each do |r|
parse_single_post(r)
end
end
def retry_parse(n)
begin
yield
rescue StandardError => e
puts "Error: #{e}\nWas not able to parse page."
raise e if n == 0
retry_parse(n - 1)
end
end
end
class GuitarScraper < AgfScraper
def default_url
"http://www.acousticguitarforum.com/forums/forumdisplay."\
"php?f=17&pp=200&sort=lastpost&order=desc&daysprune=200"
end
def default_forum
17
end
def default_sticky_posts
6
end
end
class Post < ActiveRecord::Base
extend PostUtils
default_scope { order(:id) }
end
module PostUtils
def create(posts)
posts.each do |p|
create!(title: p.title, link: p.link, author: p.author,
last_post: p.last_post, replies: p.replies, views: p.views)
end
end
def clear_db
destroy_all
reset_pk_sequence # reset primary key to 0
end
end
namespace :scrape do
desc "Scrape AGF For Sale forum"
task agf: :environment do
puts "scraping AGF pages..."
scraper = GuitarScraper.new
scraper.parse_pages.create_posts(Post)
puts "scraping complete."
end
task agf_gear: :environment do
puts "scraping AGF GEAR pages..."
scraper = GearScraper.new
scraper.parse_pages.create_posts(Gear)
puts "scraping complete"
end
task larrivee: :environment do
puts "scraping Larrivee pages..."
scraper = LarriveeScraper.new
scraper.parse_pages.create_posts(Larrivee)
puts "scraping complete"
end
task martin: :environment do
puts "scraping Martin pages..."
scraper = MartinScraper.new
scraper.parse_pages.create_posts(Martin)
puts "scraping complete"
end
end
@Jberczel
Copy link
Author

Pulled out parsing logic into ForumScraper. Creation/deletion also extracted into PostUtils. Reuse 'ForumScraper' and PostUtil in the multiple forum scrapes: guitars, gear, larrivee, martin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment