- Create list of sites to visit: arabic, hindi, russian, mundo, etc.
- Define supported screen widths, for example: 176, 208, 240, 320, 352.
- Define the maximum size of an image. The current average size of a mobile page is 250 KB.
- Quality: 100% - File size: 822 KB: http://goo.gl/A4ldIF
- Quality: 50% - File size: 272 KB:http://goo.gl/AAbrY9
- Quality: 30% - File size: 193 KB: http://goo.gl/gOUkXK
- Crawl sites in a distributed, scalable and efficient way (scheduling policy, async requests, politeness, message queues, path-ascending crawling).
- Note: The app must run very well on Linux and should also be capable of test runs on OSX.
- Identify all the links in each site and add them to a list of URLs to visit.
- Open-source C/C++ libraries: HTTrack or GNU Wget.
- Recursively visit URLs according to a set of policies.
- Update images only if the content changes. For this, the application needs to cache the web pages.
- Convert web pages to images. See PhantomJS, CasperJS, webkit2png, python-webkit
- domain/2013/12/09/page.html -> (create and save image) -> domain/2013/12/09/page.jpg
- domain/2013/12/09/ -> (create and save image) -> domain/2013/12/09.jpg
- Create image-map using HTML map element.
- Transfer images and web pages to an S3 bucket.
- Good logging is necessary for monitoring.
Last active
December 28, 2015 14:09
-
-
Save fedecarg/7512319 to your computer and use it in GitHub Desktop.
@sthulb @JakeChampion I agree, we should identify the pages most visited and start there (maybe spidering to a depth of 1)?
Probably handy: http://aws.amazon.com/sdkfornodejs/
@kenoir Very interesting. The SDK also supports SQS, which is good news.
http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-services.html
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I don't think spidering the entire sites is feasible... You'd end up with a lot of effort put into pages that'll never be loaded.