Scrapy is the web-scraper's scraper - it handles typical issues like distributed, asynchronous crawling, retrying during down-time, throttling download speeds, pagination, image downloads, generates beautiful logs and does much much more
You need a few modules to run scrapy on a Ubuntu/Debian machine (I used a cloud-based Ubuntu 14.04.4 LTS)
Following are the steps (and some recommendations)
The following was executed on a vanilla DigtialOcean Ubuntu (5 USD per month, 512 MB RAM). I feel this is sufficient to run a Scrapy crawler running at approx 1 HTTP request per second (with auto-throttle and delays turned on)
sudo apt-get update
sudo apt-get install -y build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev libffi-dev libssl-dev libxml2-dev libxslt1-dev python-pip libjpeg-dev
I would recommend using virtualenv to keep the system Python installation pure
pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install pillow
Now install scrapy using pip (the Ubuntu version is outdated)
pip install scrapy
(Scrapy.org also lists an installation process for Ubuntu but I ran into quite a few problems with this)
Create or clone a project, go to the project root and run:
scrapy crawl <project_name> -o <filename>.json
If you have written one or two web-crawlers or scrapers before, the best way to learn scrapy is through their tutorial here: scrapy tutorial
Here is a more comprehensive guide to install Scrapy on Linux
Basic Points
Even you can successfully install Scrapy on Linux without reading the basic points here, it is still recommended to read this section carefully because you will have a better understanding of Python, Scrapy, and pip.
Python Version
The python version of your env we usually talk about is the version number of the Python interpreter. The easy way to check the version number is just type
python
in your terminal.As you can see, the default python interpreter of my Ubuntu (16.04) is
2.7.10
, this version might vary from the different Linux versions. Now there are mainly two versionspython 2
andpython 3
for you to choose. The difference between them isIf you do not have a solid reason to use
python 2
, just embracepython3
, which is the present and future of python.For example, on my Ubuntu 16.04 python3 is already installed. If I type
python3
in terminalpip
pip
is the preferred installer program, for example, we can use pip to install Scrapy by typingpip install Scrapy
. It will handle all dependency for us and install them first, which is very convenient.Quick and dirty way to install Scrapy on Linux
If you want to get started quick and dirty, just use this way.
Now pip has been installed, we can use it to install python package for us now! Since pip is located in python3 package directory, we can use
pip3
instead ofpip
to make our code clear. If you installed pip in python2 package directory, you can of course usepip2
instead ofpip
.As you can see, now scrapy has been installed on global package directory of python3, which means it is available across all of your python projects.
More decent way to install Scrapy on Linux
Scrapy installed via the code above are global so they are available across all of your projects. That can be convenient at times, but it can also become problems. So how to install Scrapy on an isolated environment? This is why
virtualenv
created. On my Ubuntu machine, only a few Python packages such as pip and virtualenv are globally available — other packages such as Scrapy, Django are installed in virtual environments.We can use the code above to install pip for us, after that, we start to install
virtualenv
As you can see, we use
source scrapy_env/bin/activate
to activate the virtualenv, and now if you install python package, all of them would be in an isolated env, and the name of the virtualenv can bee seen in the shell prompt.From the code above, you can see the scrapy is now located in virtualenv we just created.
Here are my articles about Scrapy tutorial, just check it as you like.
How To Create Simple Scrapy Spider How to create a Scrapy project and a simple Scrapy spider from scratch.
Scrapy Shell Overview & Tips How to use Scrapy shell to help us extract data, and I will share with you some tips about how to make Scrapy shell more powerful.
How to use XPath with Scrapy How to use XPath in scrapy to extract info and how to help you quickly write XPath expressions.
Scrapy Selector Guide Scrapy Selector and how to create it and use it with iteration.
How To Use Scrapy Item How to define Scrapy item, and how to create a custom Item Pipeline to save the data of Item into Database.
How To Build A Real Spider How to write a real spider which can extract data and handle pagination.