Skip to content

Instantly share code, notes, and snippets.

@bbakerman
Created April 9, 2020 05:49
Show Gist options
  • Save bbakerman/f6c25ea92aee248cfb7f3b0bcfdebd65 to your computer and use it in GitHub Desktop.
Save bbakerman/f6c25ea92aee248cfb7f3b0bcfdebd65 to your computer and use it in GitHub Desktop.

Design a service which receives as input a list of URLs, scrapes those URLs for links to other pages and references images, then returns a mapping of page URLs to a list of image URLs.

Your service does not need to download and store the images.

Your service should follow links to other pages from the original submitted pages, and return the images on those 2/3/nth level pages as if they were on the first level page.

The API contract is defined as:


POSTing to /jobs with a body of a JSON array of URLs to start scraping from (e.g. ["https://google.com", "https://www.statuspage.io"]) should return a job identifier of some kind

GETing /jobs/:job_id/status with the returned job identifier should return a JSON object of the format of {"completed": x, "inprogress": y } where x is the number of original URLs which have been completely crawled and y is the number of original URLs which are still being crawled.

GETing /jobs/:job_id/results with the returned job identifier should return a JSON object returning a mapping of original URL to all reachable images from that original URL, in the format of:

{
  "https://google.com": [
    "https://google.com/images/logo_sm_2.gif",
    "https://google.com/images/warning.gif"
  ],
  "https://www.statuspage.io": [
    "https://dka575ofm4ao0.cloudfront.net/assets/base/favicon-b756db379a57687bdfa58f6bac32bec2.png",
    "https://dka575ofm4ao0.cloudfront.net/assets/base/apple-touch-icon-144x144-precomposed-293c39b0635ae7523612fe7488be9244.png"
  ]
}

NB: notice that the system does not have to track which images came from 2/3/nth level linked pages, but it does need to track which of the originally submitted URLs led to the image URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment