Skip to content

Instantly share code, notes, and snippets.

@mndrix
Last active October 23, 2016 02:20
Show Gist options
  • Save mndrix/5917531 to your computer and use it in GitHub Desktop.
Save mndrix/5917531 to your computer and use it in GitHub Desktop.
speculative sketch of a web scraper in Prolog with XPath, regular expressions and constraints
% The numerous inline comments are explanatory. Normal code wouldn't have them.
% I'm using double quotes to delineate raw text content. Real code would use
% a quasiquotation mechanism.
amazon_condition_overview(ASIN, Condition, BestPrice, Inventory, Section) :-
% HTTP GET the URL and parse it into a DOM structure
get(url("amazon.com/gp/offer-listing/$ASIN/?ie=UTF8"), DOM),
% build an atom representing the tab's node ID
TabId = qq("olpTab$Condition"),
% constrain From to only those values which this DCG can parse
phrase(from_lowest_price(BestPrice), From),
% describe where the relevant DOM nodes are and their relation to each other
DOM ~~ xpath("//ol[@id='olpTabs']/li[@id=$TabId]/a/span[@class='numberreturned'][.=$From]"),
% the h2 tag content matches this regular expression.
% interpolating Condition escapes any internal meta characters (including spaces)
H2 ~~ rx("^ \s* ${Condition} \b", ix),
% extract the number of offers using a named capture
OfferCount ~~ rx("of \s+ (?P<InventoryAtom>\d+) \s+ offer", ix),
atom_number(InventoryAtom, Inventory),
% constrain IsOffer to be an HTML attribute containing the class "olpOffer"
has_class(IsOffer, olpOffer),
% find the relevant section of the DOM
DOM ~~ xpath("//div[//h2[.=$H2]/span[.=$OfferCount]][//div[@class=$IsOffer]]", Section).
from_lowest_price(Price) -->
"from ",
dollars_and_cents(Pennies).
dollars_and_cents(Pennies) -->
"$",
integer(Dollars),
".",
integer(Cents),
Pennies is 100*Dollars + Cents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment