Skip to content

Instantly share code, notes, and snippets.

@Jberczel
Last active August 29, 2015 14:10
Show Gist options
  • Save Jberczel/331bac1025ca017c6d13 to your computer and use it in GitHub Desktop.
Save Jberczel/331bac1025ca017c6d13 to your computer and use it in GitHub Desktop.
Rails model with lots of scrape logic
class AgfScraper < ForumScraper
attr_reader :forum
def post_initialize(args)
@forum = args[:forum] || default_forum
end
def page_count
@page_count ||=
content.css('td.vbmenu_control').text.split(' ').last.to_i
end
def parse_pages
return "Scrape Failed" unless scrapable?
1.upto(page_count) do |i|
puts "\tparsing page #{i}..."
retry_parse(3) { parse_single_page(i) }
end
self
end
private
def parse_forum_rows(page_num)
page_url = url + "&num=#{page_num}"
data = Nokogiri::HTML(open(page_url))
rows = data.css("#threadbits_forum_#{forum} tr")
rows = rows.drop(sticky_posts) if page_num == 1
rows
end
def parse_post(row)
title = row.css('td')[2].css('div')[0].text.strip.gsub(/\s{2,}/, ' ')
author = row.css('td')[2].css('div')[1].text.strip.gsub(/\s{2,}/, ' ')
last_post = row.css('td')[3].text.strip.gsub(/\s{2,}/, ' ')
replies = row.css('td')[4].text
views = row.css('td')[5].text
link = row.at_css('td a')['href']
posts << OpenStruct.new(title: title, link: link, author: author,
last_post: last_post, replies: replies, views: views)
end
end
require 'open-uri'
require 'ostruct'
class ForumScraper
attr_reader :url, :sticky_posts, :posts, :content
def initialize(args={})
@url = args[:url] || default_url
@sticky_posts = args[:sticky] || default_sticky_posts
@posts = []
post_initialize(args)
end
def default_url
"http://www.google.com"
end
def default_sticky_posts
6
end
# subclasses may override
def post_initialize(args)
nil
end
def posts?
!posts.empty?
end
def content
@content ||= Nokogiri::HTML(open(url))
end
def page_count
raise NotImplementedError
end
# specific to each forum
def parse_pages
raise NotImplementedError
end
def create_posts(model)
return "No posts to create" unless posts?
model.clear_db
model.create(posts)
self
end
private
def scrapable?
content && page_count
end
def parse_single_page(page_num)
rows = parse_forum_rows(page_num)
parse_posts(rows)
end
def parse_forum_rows(page_num)
raise NotImplementedError
end
def parse_post(row)
raise NotImplementedError
end
def parse_posts(rows)
rows.each do |r|
parse_post(r)
end
end
def retry_parse(n)
begin
yield
sleep 2
rescue StandardError => e
puts "Error: #{e}\nWas not able to parse page."
raise e if n == 0
retry_parse(n - 1)
end
end
end
class Gear < ActiveRecord::Base
extend PostUtils
default_scope { order(:id) }
end
class GuitarScraper < AgfScraper
def default_url
"http://www.acousticguitarforum.com/forums/forumdisplay."\
"php?f=17&pp=200&sort=lastpost&order=desc&daysprune=200"
end
def default_forum
17
end
def default_sticky_posts
6
end
end
class Post < ActiveRecord::Base
extend PostUtils
default_scope { order(:id) }
end
module PostUtils
def create(posts)
posts.each do |p|
create!(title: p.title, link: p.link, author: p.author,
last_post: p.last_post, replies: p.replies, views: p.views)
end
end
def clear_db
destroy_all
reset_pk_sequence # reset primary key to 0
end
end
@Jberczel
Copy link
Author

Jberczel commented Dec 5, 2014

trying to abstract parsing logic

@Jberczel
Copy link
Author

Jberczel commented Dec 6, 2014

Third revision highlights:

  • Extracted parse logic from module to AgfScraper class. All the parsing happens within this class, and can retrieve all post by calling AgfScraper.new.posts
  • Extracted out creation and deletion of records to module in PostsUtils.b
  • scraped_post.rb is a module that encapsulates the specific data for each forum that is parsed. It also has a convenience method to create records from scraped data.

Now if I want to scrape a forum page, and then create records I can run this command:
ScrapedPosts.create_posts(ScrapedPosts::GUITARS, Post)

@Jberczel
Copy link
Author

Jberczel commented Dec 8, 2014

Fourth Revision:

Added templating:
ForumScraper -> AgfScraper -> GuitarScraper & GearScraper
ForumScraper -> LarriveeScraper

Now, to scrape and create posts:

    scraper = GearScraper.new
    scraper.parse_pages.create_posts(Gear)

@Jberczel
Copy link
Author

Jberczel commented Dec 8, 2014

Looks like I've added a bit more code, but it seems the decoupling from the model will make it much easier to maintain and add other forums scrapers to the website.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment