Skip to content

Instantly share code, notes, and snippets.

@andrepcg
Created July 27, 2015 15:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andrepcg/2cea45bd82fe79a1b674 to your computer and use it in GitHub Desktop.
Save andrepcg/2cea45bd82fe79a1b674 to your computer and use it in GitHub Desktop.
Parser
----------
Product parser
-------------
The parser only needs an URL to work. The main class is called **Parser** and any site specific parser extends this main class and overrides specific functions to make it work on that specific site.
#### **General process**
1. Fetch the content from the URL (scrapping, API, etc)
* Also fetch Shop from DB
2. Extract product from page
* Create product if not in DB
3. Create source from URL
* Combine Shop, Product and URL to create the Source
The Parser class constructor shall be overridden for a specific store. This class exposes it's main function ***etl_process***.
```
def etl_process
res = fetch_content
@product = extract_product(res)
@source = create_source(source_info, product, shop)
return @source
end
```
Each sub class constructor **must** initialize a couple of variables so the logic functions are able to do their work. These variables are *@url* and *@shop*
```
def initialize(full_product_url)
@url = ...
@shop = ... # find_shop(url) can be used
end
# finds a specific shop in the DB by its URL
# It already sets @shop variable to the Shop object
def find_shop(url)
end
```
Each sub class should also implement their own ***extract_product*** and ***fetch_content*** functions.
> #### ***fetch_content***
> This function is responsible for getting the information by any means necessary, such as an API request using any
> framework or a simple HTTP get request.
> **Returns:** response
----------
> #### ***extract_product***
> This function is responsible for processing the response from the *fetch_content* function and creating the Product object. It can use the already defined *create_product* function.
> **Returns:** Product
### **Auxiliary functions**
These functions can also be implemented again in each sub class.
> #### **create_product**
> **Input parameters**
- **product_info:** hash containing all the necessary parameters to create a product model
- **field:** the unique attribute to find an already existing product
> **Returns:** already existing or newly created product
----------
> #### **create_source**
> **Input parameters**
- **source_info:** hash containing all the necessary parameters to create a source model
- **product:** the product object
- **shop:** the shop object
> **Returns:** already existing or newly created source
----------
Example Amazon parser
---------------------
class AmazonParser < Parser
AMAZON_TLD_COUNTRY = {
'co.uk' => 'GB',
'com' => 'US'
}
def initialize(url, tld, asin = nil)
@url = url
@country = AMAZON_TLD_COUNTRY[tld]
find_shop('amazon.' + tld)
if asin
@asin = asin
else
@asin = getASIN
end
end
private
def getASIN
m = @url.match(/\/((gp\/product\/(?<asin>\w+))|(dp\/(?<asin>\w+)))/)
if m.nil?
return nil
else
return m[:asin]
end
end
def fetch_content # necessario
request = Vacuum.new(@country)
request.configure(
aws_access_key_id: '',
aws_secret_access_key: '',
associate_tag: 'tag'
)
request.item_lookup(
query: {
'ItemId' => @asin,
'ResponseGroup' => 'ItemAttributes,Offers,Images'
}
).to_h
end
def extract_product(response) # necessario
return nil unless response['ItemLookupResponse']['Items']['Item'] && @asin
create_product({
brand: response['ItemLookupResponse']['Items']['Item']['ItemAttributes']['Brand'],
model: response['ItemLookupResponse']['Items']['Item']['ItemAttributes']['Model'],
image_url: response['ItemLookupResponse']['Items']['Item']['MediumImage'],
asin: @asin,
name: response['ItemLookupResponse']['Items']['Item']['ItemAttributes']['Title']
}, :asin)
end
# falta concluir
def extract_price(response)
if response['ItemLookupResponse']['Items']['Item']
return response['ItemLookupResponse']['Items']['Item']['Offers']
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment