Skip to content

Instantly share code, notes, and snippets.

@samiujan
Last active September 3, 2019 07:38
Show Gist options
  • Save samiujan/29748094b3e00ac5d040c342afd50cc5 to your computer and use it in GitHub Desktop.
Save samiujan/29748094b3e00ac5d040c342afd50cc5 to your computer and use it in GitHub Desktop.
How to install Scrapy on Ubuntu

Scrapy is the web-scraper's scraper - it handles typical issues like distributed, asynchronous crawling, retrying during down-time, throttling download speeds, pagination, image downloads, generates beautiful logs and does much much more

You need a few modules to run scrapy on a Ubuntu/Debian machine (I used a cloud-based Ubuntu 14.04.4 LTS)

Following are the steps (and some recommendations)

The following was executed on a vanilla DigtialOcean Ubuntu (5 USD per month, 512 MB RAM). I feel this is sufficient to run a Scrapy crawler running at approx 1 HTTP request per second (with auto-throttle and delays turned on)

sudo apt-get update
sudo apt-get install -y build-essential autoconf libtool pkg-config python-opengl python-imaging python-pyrex python-pyside.qtopengl idle-python2.7 qt4-dev-tools qt4-designer libqtgui4 libqtcore4 libqt4-xml libqt4-test libqt4-script libqt4-network libqt4-dbus python-qt4 python-qt4-gl libgle3 python-dev libffi-dev libssl-dev libxml2-dev libxslt1-dev python-pip libjpeg-dev

I would recommend using virtualenv to keep the system Python installation pure

pip install virtualenv

virtualenv venv

source venv/bin/activate

pip install pillow


Now install scrapy using pip (the Ubuntu version is outdated)

pip install scrapy

(Scrapy.org also lists an installation process for Ubuntu but I ran into quite a few problems with this)

Create or clone a project, go to the project root and run:

scrapy crawl <project_name> -o <filename>.json

If you have written one or two web-crawlers or scrapers before, the best way to learn scrapy is through their tutorial here: scrapy tutorial

@kianxiongfoo
Copy link

since the new ubuntu has released, u missed out pip install Twisted==16.4.1 after pip install scrapy

Copy link

ghost commented Apr 22, 2017

when I install scrapy , I found I need to use
sudo pip install scrapy..
even though I use virtualenv..
or it will tell permission deny..

@michael-yin
Copy link

Here is a more comprehensive guide to install Scrapy on Linux

Basic Points

Even you can successfully install Scrapy on Linux without reading the basic points here, it is still recommended to read this section carefully because you will have a better understanding of Python, Scrapy, and pip.

Python Version

The python version of your env we usually talk about is the version number of the Python interpreter. The easy way to check the version number is just type python in your terminal.

As you can see, the default python interpreter of my Ubuntu (16.04) is 2.7.10, this version might vary from the different Linux versions. Now there are mainly two versions python 2 and python 3 for you to choose. The difference between them is

Short version: Python 2.x is legacy, Python 3.x is the present and future of the language

If you do not have a solid reason to use python 2, just embrace python3, which is the present and future of python.

For example, on my Ubuntu 16.04 python3 is already installed. If I type python3 in terminal

pip

pip is the preferred installer program, for example, we can use pip to install Scrapy by typing pip install Scrapy. It will handle all dependency for us and install them first, which is very convenient.

Quick and dirty way to install Scrapy on Linux

If you want to get started quick and dirty, just use this way.

# Install python dependency

# For apt (ubuntu, debian...):
sudo apt-get install python3-dev

# If you still use python2 on ubuntu, debian, just use
sudo apt-get install python2-dev

# For yum (centos, redhat, fedora...):
sudo yum install python-devel

# install pip
cd ~
wget https://bootstrap.pypa.io/get-pip.py
sudo python3 get-pip.py

# As you can see, pip now has been installed in python3 package directory
michaelyin@ubuntu:~$ pip -V
pip 9.0.1 from /usr/local/lib/python3.5/dist-packages (python 3.5)

Now pip has been installed, we can use it to install python package for us now! Since pip is located in python3 package directory, we can use pip3 instead of pip to make our code clear. If you installed pip in python2 package directory, you can of course use pip2 instead of pip.

sudo pip3 install scrapy

michaelyin@ubuntu:~$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> scrapy
<module 'scrapy' from '/usr/local/lib/python3.5/dist-packages/scrapy/__init__.py'>
>>> 

As you can see, now scrapy has been installed on global package directory of python3, which means it is available across all of your python projects.

More decent way to install Scrapy on Linux

Scrapy installed via the code above are global so they are available across all of your projects. That can be convenient at times, but it can also become problems. So how to install Scrapy on an isolated environment? This is why virtualenv created. On my Ubuntu machine, only a few Python packages such as pip and virtualenv are globally available — other packages such as Scrapy, Django are installed in virtual environments.

We can use the code above to install pip for us, after that, we start to install virtualenv

sudo pip3 install virtualenv
cd ~
mkdir Virtualenvs
cd Virtualenvs
virtualenv scrapy_env
# after the env was created

michaelyin@ubuntu:~/Virtualenvs$ source scrapy_env/bin/activate
(scrapy_env) michaelyin@ubuntu:~/Virtualenvs$ 

As you can see, we use source scrapy_env/bin/activate to activate the virtualenv, and now if you install python package, all of them would be in an isolated env, and the name of the virtualenv can bee seen in the shell prompt.

pip3 install scrapy

(scrapy_env) michaelyin@ubuntu:~/Virtualenvs$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import scrapy
>>> scrapy
<module 'scrapy' from '/home/michaelyin/Virtualenvs/scrapy_env/lib/python3.5/site-packages/scrapy/__init__.py'>

From the code above, you can see the scrapy is now located in virtualenv we just created.

Here are my articles about Scrapy tutorial, just check it as you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment