Skip to content

Instantly share code, notes, and snippets.

@Guri-ksolves
Created August 29, 2017 07:04
Show Gist options
  • Save Guri-ksolves/7620d2f32027cf20e72ddd61cfb58971 to your computer and use it in GitHub Desktop.
Save Guri-ksolves/7620d2f32027cf20e72ddd61cfb58971 to your computer and use it in GitHub Desktop.
web_scarper
require 'rubygems'
require 'mechanize'
require 'open-uri'
require 'pry'
require 'nokogiri'
require 'active_record'
require 'httparty'
agent = Mechanize.new
agent.get('https://www.indeed.co.in')
form = agent.page.parser.css('form')[0]
# agent.page.forms[0].fields
agent.page.forms[0]['q'] = ARGV[0]
agent.page.forms[0]['l'] = ARGV[1]
parse_page = agent.page.forms[0].submit
ActiveRecord::Base.establish_connection(
adapter: 'postgresql',
host: 'localhost',
encoding: 'unicode', database: 'web_scraper'
)
# company model
class Company < ActiveRecord::Base
has_many :jobs
end
# creating model Job
class Job < ActiveRecord::Base
belongs_to :company
end
puts '================================= first page'
company = Company.create(name: ARGV[0], loaction: ARGV[1])
parse_page.css('.row').each do |f|
job = f.css('.jobtitle').text.strip
company.jobs.create!(title: job)
puts job
end
all_links = parse_page.css('div.pagination a').map { |link| link['href'] }
if all_links.size > 0
page = 'http://www.indeed.co.in'
puts all_links
nxt = all_links[all_links.size - 1]
# binding.pry
nxt = page + nxt
puts nxt
while nxt != false
puts '================================= while page='"#{nxt}"
doc = HTTParty.get(nxt)
parse_page = Nokogiri::HTML(doc)
parse_page.css('.row').each do |f|
job = f.css('.jobtitle').text.strip
company.jobs.create!(title: job)
puts job
end
all_links = parse_page.css('div.pagination a').map { |all_linksink| all_linksink['href'] }
nxt = all_links[all_links.size - 1]
# binding.pry
temp = page + nxt
np = parse_page.at('div.pagination').text.include? "Next"
# binding.pry
if np
nxt = temp
else
puts 'finished'
nxt = false
end
end
else
puts 'finished'
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment