Scrapy is the web-scraper's scraper - it handles typical issues like distributed, asynchronous crawling, retrying during down-time, throttling download speeds, pagination, image downloads, generates beautiful logs and does much much more
You need a few modules to run scrapy on a Ubuntu/Debian machine (I used a cloud-based Ubuntu 14.04.4 LTS)
Following are the steps (and some recommendations)
The following was executed on a vanilla DigtialOcean Ubuntu (5 USD per month, 512 MB RAM). I feel this is sufficient to run a Scrapy crawler running at approx 1 HTTP request per second (with auto-throttle and delays turned on)
sudo apt-get update
sudo apt-get install -y build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev libffi-dev libssl-dev libxml2-dev libxslt1-dev python-pip libjpeg-dev
I would recommend using virtualenv to keep the system Python installation pure
pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install pillow
Now install scrapy using pip (the Ubuntu version is outdated)
pip install scrapy
(Scrapy.org also lists an installation process for Ubuntu but I ran into quite a few problems with this)
Create or clone a project, go to the project root and run:
scrapy crawl <project_name> -o <filename>.json
If you have written one or two web-crawlers or scrapers before, the best way to learn scrapy is through their tutorial here: scrapy tutorial
since the new ubuntu has released, u missed out pip install Twisted==16.4.1 after pip install scrapy