Skip to content

Instantly share code, notes, and snippets.

@fedecarg
Last active December 28, 2015 14:09
Show Gist options
  • Save fedecarg/7512319 to your computer and use it in GitHub Desktop.
Save fedecarg/7512319 to your computer and use it in GitHub Desktop.

Spike: Image-based BBC News website

  • Create list of sites to visit: arabic, hindi, russian, mundo, etc.
  • Define supported screen widths, for example: 176, 208, 240, 320, 352.
  • Define the maximum size of an image. The current average size of a mobile page is 250 KB.
  • Quality: 100% - File size: 822 KB: http://goo.gl/A4ldIF
  • Quality: 50% - File size: 272 KB:http://goo.gl/AAbrY9
  • Quality: 30% - File size: 193 KB: http://goo.gl/gOUkXK
  • Crawl sites in a distributed, scalable and efficient way (scheduling policy, async requests, politeness, message queues, path-ascending crawling).
  • Note: The app must run very well on Linux and should also be capable of test runs on OSX.
  • Identify all the links in each site and add them to a list of URLs to visit.
  • Open-source C/C++ libraries: HTTrack or GNU Wget.
  • Recursively visit URLs according to a set of policies.
  • Update images only if the content changes. For this, the application needs to cache the web pages.
  • Convert web pages to images. See PhantomJS, CasperJS, webkit2png, python-webkit
  • domain/2013/12/09/page.html -> (create and save image) -> domain/2013/12/09/page.jpg
  • domain/2013/12/09/ -> (create and save image) -> domain/2013/12/09.jpg
  • Create image-map using HTML map element.
  • Transfer images and web pages to an S3 bucket.
  • Good logging is necessary for monitoring.

Links

Converter

Crawler

@fedecarg
Copy link
Author

@kenoir Very interesting. The SDK also supports SQS, which is good news.

http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-services.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment