Skip to content

Instantly share code, notes, and snippets.

@chrisle
Last active December 10, 2015 02:58
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save chrisle/4371907 to your computer and use it in GitHub Desktop.
Save chrisle/4371907 to your computer and use it in GitHub Desktop.
module CapybaraWithPhantomJs
include Capybara
# Create a new PhantomJS session in Capybara
def new_session
# Register PhantomJS (aka poltergeist) as the driver to use
Capybara.register_driver :poltergeist do |app|
Capybara::Poltergeist::Driver.new(app)
end
# Use XPath as the default selector for the find method
Capybara.default_selector = :xpath
# Start up a new thread
@session = Capybara::Session.new(:poltergeist)
# Report using a particular user agent
@session.driver.headers = { 'User-Agent' =>
"Mozilla/5.0 (Macintosh; Intel Mac OS X)" }
# Return the driver's session
@session
end
# Returns the current session's page
def html
session.html
end
end
# Add the mixin
require 'capybara_with_phantom_js'
# Google+ Scraper
#
# === Example
#
# g_plus = GooglePlusScraper.new(111044299943603359137)
# data = g_plus.to_h
# # => { id: 111044299943603359137, in_circles: 1234, timestamp: 123456789 }
#
class GooglePlusScraper
include CapybaraWithPhantomJs
def initialize(profile_id)
@profile_id = profile_id
end
# Return a hash
def to_h
data = {
:id => @profile_id,
:in_circles => in_circles,
:timestamp => Date.today.to_datetime.to_i
}
end
# Return the circle count as an integer
def in_circles
matches = tp_tx_hp
return 0 if matches.nil?
str = matches.find { |s| s.include?('have them in circles') }
(str.nil?) ? 0 : Integer(str.gsub(/,/, '').match(/\d+/)[0])
end
# Return the text found in H3 tags
def tp_tx_hp
results = google_plus_page.search('//h3[@class="TP tx hp"]/span')
results = results.collect(&:text)
return nil if results.empty?
results
end
# Get the Google Plus page and locally cache it in an instance variable
def google_plus_page
unless @google_plus_page
new_session
visit "https://plus.google.com/u/0/#{@profile_id}/posts"
sleep 5 # give phantomjs 5 seconds and let the page fill itself in
@google_plus_page = Nokogiri::HTML.parse(html)
end
@google_plus_page
end
end
g_plus = GooglePlusScraper.new(111044299943603359137).to_h
# => { id: 111044299943603359137, in_circles: 1234, timestamp: 123456789 }
@dsadaka
Copy link

dsadaka commented Feb 19, 2015

I had to use the following updated code to make this work.
I commented changes with " # Next line Updated by Dan"
I also had to use a different profile id since the one in the sample no longer exists.

capybara_with_phantom_js.rb

module CapybaraWithPhantomJs
  # Next line Updated by Dan
  include Capybara::DSL
  require 'capybara/poltergeist'

  # Create a new PhantomJS session in Capybara
  def new_session

    # Register PhantomJS (aka poltergeist) as the driver to use
    Capybara.register_driver :poltergeist do |app|
      Capybara::Poltergeist::Driver.new(app)
    end


    # Use XPath as the default selector for the find method
    Capybara.default_selector = :xpath

    Capybara.default_driver = :poltergeist

    # Start up a new thread
    @session = Capybara::Session.new(:poltergeist)

    # Report using a particular user agent
    @session.driver.headers = { 'User-Agent' =>
                                    "Mozilla/5.0 (Macintosh; Intel Mac OS X)" }

    # Return the driver's session
    @session
  end

  # Returns the current session's page
  def html
    # Next line Updated by Dan
    @session.html
  end
end

google_plus_scraper.rb

# Add the mixin
require 'capybara_with_phantom_js'

# Google+ Scraper
#
# === Example
#
#   g_plus = GooglePlusScraper.new(111044299943603359137)
#   data = g_plus.to_h
#   # => { id: 111044299943603359137, in_circles: 1234, timestamp: 123456789 }
#
class GooglePlusScraper
  include CapybaraWithPhantomJs

  def initialize(profile_id)
    @profile_id = profile_id
  end

  # Return a hash
  def to_h
    data = {
        :id => @profile_id,
        :in_circles => in_circles,
        :timestamp => Date.today.to_datetime.to_i
    }
  end

  # Return the circle count as an integer
  def in_circles
    matches = tp_tx_hp
    return 0 if matches.nil?
    # Next line Updated by Dan to accomodate Google's changes
    str = matches.find { |s| s.include?('people') }
    (str.nil?) ? 0 : Integer(str.gsub(/,/, '').match(/\d+/)[0])
  end

  # Next line Updated by Dan to accomodate Google's changes
  # Return the text found in span tag with class="d-s r5a"
  def tp_tx_hp
    # Next line Updated by Dan to accomodate Google's changes
    results = google_plus_page.search('//span[contains(@class,"d-s r5a")]')
    results = results.collect(&:text)
    return nil if results.empty?
    results
  end

  # Get the Google Plus page and locally cache it in an instance variable
  def google_plus_page
    unless @google_plus_page
      new_session
      # Next line Updated by Dan 
      @session.visit "https://plus.google.com/u/0/#{@profile_id}/posts"
      sleep 5 # give phantomjs 5 seconds and let the page fill itself in
      @google_plus_page = Nokogiri::HTML.parse(html)
    end
    @google_plus_page
  end

end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment