Last active
August 29, 2015 14:11
-
-
Save Jberczel/17cce0c5bc2cde4f79b9 to your computer and use it in GitHub Desktop.
Extract scrape logic out of models
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class AgfScraper < ForumScraper | |
attr_reader :forum | |
def post_initialize(args) | |
@forum = args[:forum] || default_forum | |
end | |
def page_count | |
@page_count ||= | |
content.css('td.vbmenu_control').text.split(' ').last.to_i | |
end | |
private | |
def parse_table(page_num) | |
data = page_data(page_url(page_num)) | |
rows = table_rows(data) | |
remove_sticky_posts(rows, page_num) | |
end | |
def page_url(page_num) | |
"#{url}&page=#{page_num}" | |
end | |
def page_data(url) | |
Nokogiri::HTML(open(url)) | |
end | |
def table_rows(data) | |
data.css("#threadbits_forum_#{forum} tr") | |
end | |
def remove_sticky_posts(rows, page_num) | |
page_num == 1 ? rows.drop(sticky_posts) : rows | |
end | |
def parse_single_post(row) | |
title = row.css('td')[2].css('div')[0].text.strip.gsub(/\s{2,}/, ' ') | |
author = row.css('td')[2].css('div')[1].text.strip.gsub(/\s{2,}/, ' ') | |
last_post = row.css('td')[3].text.strip.gsub(/\s{2,}/, ' ') | |
replies = row.css('td')[4].text | |
views = row.css('td')[5].text | |
link = row.at_css('td a')['href'] | |
posts << OpenStruct.new(title: title, link: link, author: author, | |
last_post: last_post, replies: replies, views: views) | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'open-uri' | |
require 'ostruct' | |
class ForumScraper | |
attr_reader :url, :sticky_posts, :posts, :content | |
def initialize(args={}) | |
@url = args[:url] || default_url | |
@sticky_posts = args[:sticky] || default_sticky_posts | |
@posts = [] | |
post_initialize(args) | |
end | |
def default_url | |
"http://www.google.com" | |
end | |
def default_sticky_posts | |
6 | |
end | |
# subclasses may override | |
def post_initialize(args) | |
nil | |
end | |
def posts? | |
!posts.empty? | |
end | |
def content | |
@content ||= Nokogiri::HTML(open(url)) | |
end | |
# initialize in subclasses | |
def page_count | |
raise NotImplementedError | |
end | |
def parse_pages | |
return "Scrape Failed" unless scrapable? | |
1.upto(page_count) do |i| | |
puts "\tparsing page #{i}..." | |
retry_parse(3) { parse_single_page(i) } | |
end | |
self | |
end | |
def create_posts(model) | |
return "No posts to create" unless posts? | |
model.clear_db | |
model.create(posts) | |
self | |
end | |
private | |
def scrapable? | |
content && page_count | |
end | |
def parse_single_page(page_num) | |
data_table = parse_table(page_num) | |
parse_posts(data_table) | |
end | |
# parse html table, page_num is url parameter | |
def parse_table(page_num) | |
raise NotImplementedError | |
end | |
# parse single row in html table | |
def parse_single_post(row) | |
raise NotImplementedError | |
end | |
def parse_posts(rows) | |
rows.each do |r| | |
parse_single_post(r) | |
end | |
end | |
def retry_parse(n) | |
begin | |
yield | |
rescue StandardError => e | |
puts "Error: #{e}\nWas not able to parse page." | |
raise e if n == 0 | |
retry_parse(n - 1) | |
end | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class GuitarScraper < AgfScraper | |
def default_url | |
"http://www.acousticguitarforum.com/forums/forumdisplay."\ | |
"php?f=17&pp=200&sort=lastpost&order=desc&daysprune=200" | |
end | |
def default_forum | |
17 | |
end | |
def default_sticky_posts | |
6 | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class Post < ActiveRecord::Base | |
extend PostUtils | |
default_scope { order(:id) } | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
module PostUtils | |
def create(posts) | |
posts.each do |p| | |
create!(title: p.title, link: p.link, author: p.author, | |
last_post: p.last_post, replies: p.replies, views: p.views) | |
end | |
end | |
def clear_db | |
destroy_all | |
reset_pk_sequence # reset primary key to 0 | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
namespace :scrape do | |
desc "Scrape AGF For Sale forum" | |
task agf: :environment do | |
puts "scraping AGF pages..." | |
scraper = GuitarScraper.new | |
scraper.parse_pages.create_posts(Post) | |
puts "scraping complete." | |
end | |
task agf_gear: :environment do | |
puts "scraping AGF GEAR pages..." | |
scraper = GearScraper.new | |
scraper.parse_pages.create_posts(Gear) | |
puts "scraping complete" | |
end | |
task larrivee: :environment do | |
puts "scraping Larrivee pages..." | |
scraper = LarriveeScraper.new | |
scraper.parse_pages.create_posts(Larrivee) | |
puts "scraping complete" | |
end | |
task martin: :environment do | |
puts "scraping Martin pages..." | |
scraper = MartinScraper.new | |
scraper.parse_pages.create_posts(Martin) | |
puts "scraping complete" | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Pulled out parsing logic into
ForumScraper
. Creation/deletion also extracted intoPostUtils
. Reuse 'ForumScraper' andPostUtil
in the multiple forum scrapes: guitars, gear, larrivee, martin.