Created
July 27, 2015 15:17
-
-
Save andrepcg/2cea45bd82fe79a1b674 to your computer and use it in GitHub Desktop.
Parser
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
---------- | |
Product parser | |
------------- | |
The parser only needs an URL to work. The main class is called **Parser** and any site specific parser extends this main class and overrides specific functions to make it work on that specific site. | |
#### **General process** | |
1. Fetch the content from the URL (scrapping, API, etc) | |
* Also fetch Shop from DB | |
2. Extract product from page | |
* Create product if not in DB | |
3. Create source from URL | |
* Combine Shop, Product and URL to create the Source | |
The Parser class constructor shall be overridden for a specific store. This class exposes it's main function ***etl_process***. | |
``` | |
def etl_process | |
res = fetch_content | |
@product = extract_product(res) | |
@source = create_source(source_info, product, shop) | |
return @source | |
end | |
``` | |
Each sub class constructor **must** initialize a couple of variables so the logic functions are able to do their work. These variables are *@url* and *@shop* | |
``` | |
def initialize(full_product_url) | |
@url = ... | |
@shop = ... # find_shop(url) can be used | |
end | |
# finds a specific shop in the DB by its URL | |
# It already sets @shop variable to the Shop object | |
def find_shop(url) | |
end | |
``` | |
Each sub class should also implement their own ***extract_product*** and ***fetch_content*** functions. | |
> #### ***fetch_content*** | |
> This function is responsible for getting the information by any means necessary, such as an API request using any | |
> framework or a simple HTTP get request. | |
> **Returns:** response | |
---------- | |
> #### ***extract_product*** | |
> This function is responsible for processing the response from the *fetch_content* function and creating the Product object. It can use the already defined *create_product* function. | |
> **Returns:** Product | |
### **Auxiliary functions** | |
These functions can also be implemented again in each sub class. | |
> #### **create_product** | |
> **Input parameters** | |
- **product_info:** hash containing all the necessary parameters to create a product model | |
- **field:** the unique attribute to find an already existing product | |
> **Returns:** already existing or newly created product | |
---------- | |
> #### **create_source** | |
> **Input parameters** | |
- **source_info:** hash containing all the necessary parameters to create a source model | |
- **product:** the product object | |
- **shop:** the shop object | |
> **Returns:** already existing or newly created source | |
---------- | |
Example Amazon parser | |
--------------------- | |
class AmazonParser < Parser | |
AMAZON_TLD_COUNTRY = { | |
'co.uk' => 'GB', | |
'com' => 'US' | |
} | |
def initialize(url, tld, asin = nil) | |
@url = url | |
@country = AMAZON_TLD_COUNTRY[tld] | |
find_shop('amazon.' + tld) | |
if asin | |
@asin = asin | |
else | |
@asin = getASIN | |
end | |
end | |
private | |
def getASIN | |
m = @url.match(/\/((gp\/product\/(?<asin>\w+))|(dp\/(?<asin>\w+)))/) | |
if m.nil? | |
return nil | |
else | |
return m[:asin] | |
end | |
end | |
def fetch_content # necessario | |
request = Vacuum.new(@country) | |
request.configure( | |
aws_access_key_id: '', | |
aws_secret_access_key: '', | |
associate_tag: 'tag' | |
) | |
request.item_lookup( | |
query: { | |
'ItemId' => @asin, | |
'ResponseGroup' => 'ItemAttributes,Offers,Images' | |
} | |
).to_h | |
end | |
def extract_product(response) # necessario | |
return nil unless response['ItemLookupResponse']['Items']['Item'] && @asin | |
create_product({ | |
brand: response['ItemLookupResponse']['Items']['Item']['ItemAttributes']['Brand'], | |
model: response['ItemLookupResponse']['Items']['Item']['ItemAttributes']['Model'], | |
image_url: response['ItemLookupResponse']['Items']['Item']['MediumImage'], | |
asin: @asin, | |
name: response['ItemLookupResponse']['Items']['Item']['ItemAttributes']['Title'] | |
}, :asin) | |
end | |
# falta concluir | |
def extract_price(response) | |
if response['ItemLookupResponse']['Items']['Item'] | |
return response['ItemLookupResponse']['Items']['Item']['Offers'] | |
end | |
end | |
end | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment