Skip to content

Instantly share code, notes, and snippets.

@enriquemanuel
Last active December 25, 2015 19:39
Show Gist options
  • Save enriquemanuel/7029821 to your computer and use it in GitHub Desktop.
Save enriquemanuel/7029821 to your computer and use it in GitHub Desktop.
After several attempts of Web Scrapping I got to a point that I'm proud of my project, by all means this is not done and is just part of the project, This will scrape a page to get entities and match them to a server and then return that hash information. This will be used to get the images using CURB in a multi threaded application
require 'nokogiri'
class Scrape
#start function
def scrape_it(page_as_html, client_id)
#open the file
f = File.open(page_as_html)
#initialize the file for web scraping
html = Nokogiri::HTML(f)
#get all the div class="Info"
entities = html.css('.Info')
# define variables to use
# two hashes, one for that is the intermediate and final use (entities_ids)
# the other is the temp_hash that needs to be cleared from memory after use
entities_ids = Hash.new
temp_hash = Hash.new
#start loop
entities.each do |entity|
# start if
if entity.text =~ (/#{client_id}/) # this is the search for the client id
=begin
#long version to remove things
for_entity = entity.xpath('@for').to_s
for_entity = for_entity.chop
for_entity = for_entity.gsub("entity_[","")
=end
#short version to remove things
for_entity = entity.xpath('@for').to_s.chop.gsub("entity_[","")
server = entity.text.split(':')[0]
#puts server + " : " + for_entity
#adding to hash to get the new value
# its looks like this now:
# entities_ids["apcprd_100367_57480_app14"] = 42 -> this value (42) is incorrect
# its the identifier that we need to get again from the checkbox
entities_ids[server]=for_entity
# end if
end
# now lets get the checkboxes and create a temporary hash with ALL the checkboxes
# this can be enhanced later
name = entity.xpath('@name').to_s.lstrip
id = entity.xpath('@id').to_s.lstrip.chop.gsub("entity_[","")
temp_hash[id]=name
#end loop
end
# this will create our final hash with the correct image entity
# it was like this:
# entities_ids["apcprd_100367_57480_app14"]=42 -> this value is incorrect
# will end like this:
# entities_ids["apcprd_100367_57480_app14"]=132123123 -> this is the image entity
entities_ids.each do|server, entity|
image_id = temp_hash[entity].chop.gsub("entity[","")
entities_ids[server]=image_id
end
#return hash from function for further use
return entities_ids
# end function
end
# end class
end
@enriquemanuel
Copy link
Author

To test this you need to perform the following:

require_relative 'scrape'

s = Scrape.new
hash = Hash.new
hash = s.scrape_it('Reports.html', 100367)
hash.each do |name, id|
    puts name + " "+ id
end

The output will be:

apcprd_100367_57480_app09 1430805166
apcprd_100367_57480_app10 1684625356
apcprd_100367_57480_app11 1684627183
apcprd_100367_57480_app12 1705898744
apcprd_100367_57480_app13 1706046783
apcprd_100367_57480_app14 2016696810
apcprd_100367_57480_app15 2016697514
apcstg_100367_57480_app02 807124402

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment