Skip to content

Instantly share code, notes, and snippets.

@bbakerman
Created April 27, 2020 03:58
Show Gist options
  • Save bbakerman/52c54a2242adc4d2f03adc08ccadf1ba to your computer and use it in GitHub Desktop.
Save bbakerman/52c54a2242adc4d2f03adc08ccadf1ba to your computer and use it in GitHub Desktop.
Design a service which receives as input a list of URLs, scrapes those URLs for links to other pages and references images, then returns a mapping of page URLs to a list of image URLs.
Your service does not need to download and store the images.
Your service should follow links to other pages from the original submitted pages, and return the images on
those 2/3/nth level pages as if they were on the first level page.
The API contract is defined as:
POSTing to /jobs with a body of a JSON array of URLs to start scrapin0
g from (e.g. ["https://google.com", "https://www.statuspage.io"]) should return a job identifier of some kind
GETing /jobs/:job_id/status with the returned job identifier should return a JSON object of the
format of {"completed": x, "inprogress": y } where x is the number of original URLs which have been completely
crawled and y is the number of original URLs which are still being crawled.
GETing /jobs/:job_id/results with the returned job identifier should return a JSON object returning a
mapping of original URL to all reachable images from that original URL, in the format of:
{
"https://google.com": [
"https://google.com/images/logo_sm_2.gif",
"https://google.com/images/warning.gif"
],
"https://www.statuspage.io": [
"https://dka575ofm4ao0.cloudfront.net/assets/base/favicon-b756db379a57687bdfa58f6bac32bec2.png",
"https://dka575ofm4ao0.cloudfront.net/assets/base/apple-touch-icon-144x144-precomposed-293c39b0635ae7523612fe7488be9244.png"
]
}
NB: notice that the system does not have to track which images came from 2/3/nth level linked pages,
but it does need to track which of the originally submitted URLs led to the image URL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment