Last active
December 25, 2015 19:39
-
-
Save enriquemanuel/7029821 to your computer and use it in GitHub Desktop.
After several attempts of Web Scrapping I got to a point that I'm proud of my project, by all means this is not done and is just part of the project, This will scrape a page to get entities and match them to a server and then return that hash information.
This will be used to get the images using CURB in a multi threaded application
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
require 'nokogiri' | |
class Scrape | |
#start function | |
def scrape_it(page_as_html, client_id) | |
#open the file | |
f = File.open(page_as_html) | |
#initialize the file for web scraping | |
html = Nokogiri::HTML(f) | |
#get all the div class="Info" | |
entities = html.css('.Info') | |
# define variables to use | |
# two hashes, one for that is the intermediate and final use (entities_ids) | |
# the other is the temp_hash that needs to be cleared from memory after use | |
entities_ids = Hash.new | |
temp_hash = Hash.new | |
#start loop | |
entities.each do |entity| | |
# start if | |
if entity.text =~ (/#{client_id}/) # this is the search for the client id | |
=begin | |
#long version to remove things | |
for_entity = entity.xpath('@for').to_s | |
for_entity = for_entity.chop | |
for_entity = for_entity.gsub("entity_[","") | |
=end | |
#short version to remove things | |
for_entity = entity.xpath('@for').to_s.chop.gsub("entity_[","") | |
server = entity.text.split(':')[0] | |
#puts server + " : " + for_entity | |
#adding to hash to get the new value | |
# its looks like this now: | |
# entities_ids["apcprd_100367_57480_app14"] = 42 -> this value (42) is incorrect | |
# its the identifier that we need to get again from the checkbox | |
entities_ids[server]=for_entity | |
# end if | |
end | |
# now lets get the checkboxes and create a temporary hash with ALL the checkboxes | |
# this can be enhanced later | |
name = entity.xpath('@name').to_s.lstrip | |
id = entity.xpath('@id').to_s.lstrip.chop.gsub("entity_[","") | |
temp_hash[id]=name | |
#end loop | |
end | |
# this will create our final hash with the correct image entity | |
# it was like this: | |
# entities_ids["apcprd_100367_57480_app14"]=42 -> this value is incorrect | |
# will end like this: | |
# entities_ids["apcprd_100367_57480_app14"]=132123123 -> this is the image entity | |
entities_ids.each do|server, entity| | |
image_id = temp_hash[entity].chop.gsub("entity[","") | |
entities_ids[server]=image_id | |
end | |
#return hash from function for further use | |
return entities_ids | |
# end function | |
end | |
# end class | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To test this you need to perform the following:
The output will be: