fedecarg/image-based-website.md

## image-based-website.md

      
    Raw
  

              image-based-website.md
            
          
    Spike: Image-based BBC News website


Create list of sites to visit: arabic, hindi, russian, mundo, etc.
Define supported screen widths, for example: 176, 208, 240, 320, 352.
Define the maximum size of an image. The current average size of a mobile page is 250 KB.
Quality: 100% - File size: 822 KB: http://goo.gl/A4ldIF
Quality: 50% - File size: 272 KB:http://goo.gl/AAbrY9
Quality: 30% - File size: 193 KB: http://goo.gl/gOUkXK
Crawl sites in a distributed, scalable and efficient way (scheduling policy, async requests, politeness, message queues, path-ascending crawling).
Note: The app must run very well on Linux and should also be capable of test runs on OSX.
Identify all the links in each site and add them to a list of URLs to visit.
Open-source C/C++ libraries: HTTrack or GNU Wget.
Recursively visit URLs according to a set of policies.
Update images only if the content changes. For this, the application needs to cache the web pages.
Convert web pages to images. See PhantomJS, CasperJS, webkit2png, python-webkit
domain/2013/12/09/page.html -> (create and save image) -> domain/2013/12/09/page.jpg
domain/2013/12/09/ -> (create and save image) -> domain/2013/12/09.jpg
Create image-map using HTML map element.
Transfer images and web pages to an S3 bucket.
Good logging is necessary for monitoring.

Links

Converter


Ruby/PhantomJS - A responsive screenshot comparison tool
Python/Webkit - Create screenshot of a web page and an image-map of the links
Create screenshots using Python Webkit and RabbitMQ
Extract links from web pages
Java - Convert HTML to image with client-side image-map for links

Crawler


Ruby crawler using IronMQ and Nokogiri
Python crawl tools
A scraping framework written in JavaScript using PhantomJS and jQuery
A crawler example using CasperJS and PhanthomJS