Last active
October 23, 2016 02:20
-
-
Save mndrix/5917531 to your computer and use it in GitHub Desktop.
speculative sketch of a web scraper in Prolog with XPath, regular expressions and constraints
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
% The numerous inline comments are explanatory. Normal code wouldn't have them. | |
% I'm using double quotes to delineate raw text content. Real code would use | |
% a quasiquotation mechanism. | |
amazon_condition_overview(ASIN, Condition, BestPrice, Inventory, Section) :- | |
% HTTP GET the URL and parse it into a DOM structure | |
get(url("amazon.com/gp/offer-listing/$ASIN/?ie=UTF8"), DOM), | |
% build an atom representing the tab's node ID | |
TabId = qq("olpTab$Condition"), | |
% constrain From to only those values which this DCG can parse | |
phrase(from_lowest_price(BestPrice), From), | |
% describe where the relevant DOM nodes are and their relation to each other | |
DOM ~~ xpath("//ol[@id='olpTabs']/li[@id=$TabId]/a/span[@class='numberreturned'][.=$From]"), | |
% the h2 tag content matches this regular expression. | |
% interpolating Condition escapes any internal meta characters (including spaces) | |
H2 ~~ rx("^ \s* ${Condition} \b", ix), | |
% extract the number of offers using a named capture | |
OfferCount ~~ rx("of \s+ (?P<InventoryAtom>\d+) \s+ offer", ix), | |
atom_number(InventoryAtom, Inventory), | |
% constrain IsOffer to be an HTML attribute containing the class "olpOffer" | |
has_class(IsOffer, olpOffer), | |
% find the relevant section of the DOM | |
DOM ~~ xpath("//div[//h2[.=$H2]/span[.=$OfferCount]][//div[@class=$IsOffer]]", Section). | |
from_lowest_price(Price) --> | |
"from ", | |
dollars_and_cents(Pennies). | |
dollars_and_cents(Pennies) --> | |
"$", | |
integer(Dollars), | |
".", | |
integer(Cents), | |
Pennies is 100*Dollars + Cents. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment