- Create list of sites to visit: arabic, hindi, russian, mundo, etc.
- Define supported screen widths, for example: 176, 208, 240, 320, 352.
- Define the maximum size of an image. The current average size of a mobile page is 250 KB.
- Quality: 100% - File size: 822 KB: http://goo.gl/A4ldIF
- Quality: 50% - File size: 272 KB:http://goo.gl/AAbrY9
- Quality: 30% - File size: 193 KB: http://goo.gl/gOUkXK
- Crawl sites in a distributed, scalable and efficient way (scheduling policy, async requests, politeness, message queues, path-ascending crawling).
- Note: The app must run very well on Linux and should also be capable of test runs on OSX.
- Identify all the links in each site and add them to a list of URLs to visit.
- Open-source C/C++ libraries: HTTrack or GNU Wget.
- Recursively visit URLs according to a set of policies.
- Update images only if the content changes. For this, the application needs to cache the web pages.
- Convert web pages to images. See PhantomJS, CasperJS, webkit2png, python-webkit
- domain/2013/12/09/page.html -> (create and save image) -> domain/2013/12/09/page.jpg
- domain/2013/12/09/ -> (create and save image) -> domain/2013/12/09.jpg
- Create image-map using HTML map element.
- Transfer images and web pages to an S3 bucket.
- Good logging is necessary for monitoring.
Last active
December 28, 2015 14:09
-
-
Save fedecarg/7512319 to your computer and use it in GitHub Desktop.
@kenoir Very interesting. The SDK also supports SQS, which is good news.
http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-services.html
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @sthulb @kenoir
That's a good point. The application can always whilelist or blacklist URLs, extract links and create image-maps according to a predefined set of rules. The sites are relatively small, they have an average of 500 HTML pages each.