Created
April 27, 2020 03:58
-
-
Save bbakerman/52c54a2242adc4d2f03adc08ccadf1ba to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Design a service which receives as input a list of URLs, scrapes those URLs for links to other pages and references images, then returns a mapping of page URLs to a list of image URLs. | |
Your service does not need to download and store the images. | |
Your service should follow links to other pages from the original submitted pages, and return the images on | |
those 2/3/nth level pages as if they were on the first level page. | |
The API contract is defined as: | |
POSTing to /jobs with a body of a JSON array of URLs to start scrapin0 | |
g from (e.g. ["https://google.com", "https://www.statuspage.io"]) should return a job identifier of some kind | |
GETing /jobs/:job_id/status with the returned job identifier should return a JSON object of the | |
format of {"completed": x, "inprogress": y } where x is the number of original URLs which have been completely | |
crawled and y is the number of original URLs which are still being crawled. | |
GETing /jobs/:job_id/results with the returned job identifier should return a JSON object returning a | |
mapping of original URL to all reachable images from that original URL, in the format of: | |
{ | |
"https://google.com": [ | |
"https://google.com/images/logo_sm_2.gif", | |
"https://google.com/images/warning.gif" | |
], | |
"https://www.statuspage.io": [ | |
"https://dka575ofm4ao0.cloudfront.net/assets/base/favicon-b756db379a57687bdfa58f6bac32bec2.png", | |
"https://dka575ofm4ao0.cloudfront.net/assets/base/apple-touch-icon-144x144-precomposed-293c39b0635ae7523612fe7488be9244.png" | |
] | |
} | |
NB: notice that the system does not have to track which images came from 2/3/nth level linked pages, | |
but it does need to track which of the originally submitted URLs led to the image URL. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment