Skip to content

Instantly share code, notes, and snippets.

@fedecarg
Last active December 28, 2015 14:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save fedecarg/7512319 to your computer and use it in GitHub Desktop.
Save fedecarg/7512319 to your computer and use it in GitHub Desktop.

Spike: Image-based BBC News website

  • Create list of sites to visit: arabic, hindi, russian, mundo, etc.
  • Define supported screen widths, for example: 176, 208, 240, 320, 352.
  • Define the maximum size of an image. The current average size of a mobile page is 250 KB.
  • Quality: 100% - File size: 822 KB: http://goo.gl/A4ldIF
  • Quality: 50% - File size: 272 KB:http://goo.gl/AAbrY9
  • Quality: 30% - File size: 193 KB: http://goo.gl/gOUkXK
  • Crawl sites in a distributed, scalable and efficient way (scheduling policy, async requests, politeness, message queues, path-ascending crawling).
  • Note: The app must run very well on Linux and should also be capable of test runs on OSX.
  • Identify all the links in each site and add them to a list of URLs to visit.
  • Open-source C/C++ libraries: HTTrack or GNU Wget.
  • Recursively visit URLs according to a set of policies.
  • Update images only if the content changes. For this, the application needs to cache the web pages.
  • Convert web pages to images. See PhantomJS, CasperJS, webkit2png, python-webkit
  • domain/2013/12/09/page.html -> (create and save image) -> domain/2013/12/09/page.jpg
  • domain/2013/12/09/ -> (create and save image) -> domain/2013/12/09.jpg
  • Create image-map using HTML map element.
  • Transfer images and web pages to an S3 bucket.
  • Good logging is necessary for monitoring.

Links

Converter

Crawler

@sthulb
Copy link

sthulb commented Nov 19, 2013

I don't think spidering the entire sites is feasible... You'd end up with a lot of effort put into pages that'll never be loaded.

@kenoir
Copy link

kenoir commented Nov 19, 2013

@sthulb @JakeChampion I agree, we should identify the pages most visited and start there (maybe spidering to a depth of 1)?

@kenoir
Copy link

kenoir commented Nov 19, 2013

@fedecarg
Copy link
Author

Hi @sthulb @kenoir

That's a good point. The application can always whilelist or blacklist URLs, extract links and create image-maps according to a predefined set of rules. The sites are relatively small, they have an average of 500 HTML pages each.

@fedecarg
Copy link
Author

@kenoir Very interesting. The SDK also supports SQS, which is good news.

http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/node-services.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment