Last active
August 29, 2015 14:10
-
-
Save Jberczel/331bac1025ca017c6d13 to your computer and use it in GitHub Desktop.
Rails model with lots of scrape logic
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class AgfScraper < ForumScraper | |
attr_reader :forum | |
def post_initialize(args) | |
@forum = args[:forum] || default_forum | |
end | |
def page_count | |
@page_count ||= | |
content.css('td.vbmenu_control').text.split(' ').last.to_i | |
end | |
def parse_pages | |
return "Scrape Failed" unless scrapable? | |
1.upto(page_count) do |i| | |
puts "\tparsing page #{i}..." | |
retry_parse(3) { parse_single_page(i) } | |
end | |
self | |
end | |
private | |
def parse_forum_rows(page_num) | |
page_url = url + "&num=#{page_num}" | |
data = Nokogiri::HTML(open(page_url)) | |
rows = data.css("#threadbits_forum_#{forum} tr") | |
rows = rows.drop(sticky_posts) if page_num == 1 | |
rows | |
end | |
def parse_post(row) | |
title = row.css('td')[2].css('div')[0].text.strip.gsub(/\s{2,}/, ' ') | |
author = row.css('td')[2].css('div')[1].text.strip.gsub(/\s{2,}/, ' ') | |
last_post = row.css('td')[3].text.strip.gsub(/\s{2,}/, ' ') | |
replies = row.css('td')[4].text | |
views = row.css('td')[5].text | |
link = row.at_css('td a')['href'] | |
posts << OpenStruct.new(title: title, link: link, author: author, | |
last_post: last_post, replies: replies, views: views) | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'open-uri' | |
require 'ostruct' | |
class ForumScraper | |
attr_reader :url, :sticky_posts, :posts, :content | |
def initialize(args={}) | |
@url = args[:url] || default_url | |
@sticky_posts = args[:sticky] || default_sticky_posts | |
@posts = [] | |
post_initialize(args) | |
end | |
def default_url | |
"http://www.google.com" | |
end | |
def default_sticky_posts | |
6 | |
end | |
# subclasses may override | |
def post_initialize(args) | |
nil | |
end | |
def posts? | |
!posts.empty? | |
end | |
def content | |
@content ||= Nokogiri::HTML(open(url)) | |
end | |
def page_count | |
raise NotImplementedError | |
end | |
# specific to each forum | |
def parse_pages | |
raise NotImplementedError | |
end | |
def create_posts(model) | |
return "No posts to create" unless posts? | |
model.clear_db | |
model.create(posts) | |
self | |
end | |
private | |
def scrapable? | |
content && page_count | |
end | |
def parse_single_page(page_num) | |
rows = parse_forum_rows(page_num) | |
parse_posts(rows) | |
end | |
def parse_forum_rows(page_num) | |
raise NotImplementedError | |
end | |
def parse_post(row) | |
raise NotImplementedError | |
end | |
def parse_posts(rows) | |
rows.each do |r| | |
parse_post(r) | |
end | |
end | |
def retry_parse(n) | |
begin | |
yield | |
sleep 2 | |
rescue StandardError => e | |
puts "Error: #{e}\nWas not able to parse page." | |
raise e if n == 0 | |
retry_parse(n - 1) | |
end | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class Gear < ActiveRecord::Base | |
extend PostUtils | |
default_scope { order(:id) } | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class GuitarScraper < AgfScraper | |
def default_url | |
"http://www.acousticguitarforum.com/forums/forumdisplay."\ | |
"php?f=17&pp=200&sort=lastpost&order=desc&daysprune=200" | |
end | |
def default_forum | |
17 | |
end | |
def default_sticky_posts | |
6 | |
end | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class Post < ActiveRecord::Base | |
extend PostUtils | |
default_scope { order(:id) } | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
module PostUtils | |
def create(posts) | |
posts.each do |p| | |
create!(title: p.title, link: p.link, author: p.author, | |
last_post: p.last_post, replies: p.replies, views: p.views) | |
end | |
end | |
def clear_db | |
destroy_all | |
reset_pk_sequence # reset primary key to 0 | |
end | |
end |
Third revision highlights:
- Extracted parse logic from module to AgfScraper class. All the parsing happens within this class, and can retrieve all post by calling
AgfScraper.new.posts
- Extracted out creation and deletion of records to module in
PostsUtils.b
scraped_post.rb
is a module that encapsulates the specific data for each forum that is parsed. It also has a convenience method to create records from scraped data.
Now if I want to scrape a forum page, and then create records I can run this command:
ScrapedPosts.create_posts(ScrapedPosts::GUITARS, Post)
Fourth Revision:
Added templating:
ForumScraper -> AgfScraper -> GuitarScraper & GearScraper
ForumScraper -> LarriveeScraper
Now, to scrape and create posts:
scraper = GearScraper.new
scraper.parse_pages.create_posts(Gear)
Looks like I've added a bit more code, but it seems the decoupling from the model will make it much easier to maintain and add other forums scrapers to the website.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
trying to abstract parsing logic