Welcome to GA's Data Science Immersive! Before you start class, you'll need to download and install a few tools. Follow this guide to get your computer all set up, and let us know if you have any questions.
While you can be a data scientist on any operating system, most practicing data scientists choose a Unix-type operating system, typically either MAC or a popular linux distribution such as Ubuntu or Linux Mint.
- If you are already using Mac or Linux, great! Skip ahead to Step 2 and get started with your installs.
Skip this section if you're on MAC or Linux.
- Install
gitbash
(32-bit or 64-bit depending on your version of windows): https://www.youtube.com/watch?v=rWboGsc6CqI
From here on out, you have 2 options. It's recommended that you use Anaconda for Windows, and you can install it using the Python 3.6 graphical installer here (use 64 or 32 bit depending on your flavor of Windows):
https://www.anaconda.com/download/#windows
After you install Anaconda, and Gitbash, follow the original guide that is designed for MAC. In a nutshell, the abridged version of the Anaconda setup for Windows (while in a Gitbash session), should be:
Create your conda environment
conda create -n dsi python=3.6.5 anaconda
Activate your dsi
environment
activate dsi
Install additional packages
conda install nb_conda=2.2.1 statsmodels=0.8.0 widgetsnbextension=3.0.3 spacy nltk gensim seaborn=0.7.1 scikit-image=0.13.1 scikit-learn=0.19.1 psycopg2 plotly bokeh ipywidgets flask django beautifulsoup4
Then, just install Chrome web browser.
Another option you can consider, is using Docker. Basically, you can be up and running within a very short period of time if your windows system doesn't have too many Firewalls or anti-virus applications to modify. This is nice to have, but if you can't get it to work within an hour, use the preferred Anaconda setup.
- Please see the guidelines for installing Docker.
After you've successfully installed Docker, simply create the container using the following from gitbash
. This will create a virtual machine instance called dsi
that you can start and stop in the future.
-
Create Docker container for
dsi
environment, then start jupyter notebook.docker run -d -p 8888:8888 -v `pwd`:/home/jovyan --name dsi jupyter/scipy-notebook
-
Get the url of your running jupyter server on your
dsi
container, which includes the token necessary to get access:docker exec dsi jupyter notebook list
-
To access jupyter paste the url from your terminal that looks like
http://localhost:8888/?token=b6965133171f7f5fccc788cf4f55f3a4917a07f0e816a48a
into your browser. Replace the token parameter with the reference from step 2.
-
If the "Docker" command doesn't respond, make sure you have installed Docker correctly and that it's in the path. It's also necessary if you run the docker command via gitbash rather than DOS.
-
Can't connect to your notebook via web browser? Check that your firewalls are allowing access, or turn them off. This is a common problem but unfortunately, not all configurations are the same so this may take some research in order to work correctly. Please be patient because Windows machines aren't the easiest to support due to the market adaptation of Unix based systems for the development environment data science prefers.
Linux users do not do this. Your setup is complete in terms of development environment.
In our class, we'll be working closely with tools that utilize the Python programming language. Anaconda is a popular cross-platform tool that helps install and manage python-related data science libraries. While you may have set this up prior to the class, perhaps as instructed by our prework platform, it's important that we're all setup with the same version for class.
Previously Installed Anaconda?
Please refer to your local instructor for the proper uninstallation instructions.
- Download Anaconda and follow the installation instructions package for your operating system. For MAC, use the macOS graphical install guide, with the Python 3.6 Anaconda package file.
- Agree to the terms and let Anaconda go through its default installation.
- Anaconda should install several packages by default, including:
- python: a programming language very popular with data scientists
- jupyter: an interface for creating interactive python notebooks, great for sharing analyses
- matplotlib: a plotting library for python
- nltk: a toolkit for natural language processing
- numpy: a linear algebra library
- pip & setuptools: software to manage and install python packages
- scikit-learn: a toolkit for machine learning algorithms
- scipy and statsmodels: statistical packages for python
- sqlite: a popular, easy to use database
- We will be using Conda virtual environments. "But why" you might ask?! Everyone has different versions of libraries, system tools, and underlying operating system resources. Using a Conda virtual environment helps mitigate the differences everyone's system brings, with a consistent baseline development environment, and should reduce problems overall.
IMPORT FIRST STEP!
Verify that conda
is setup and in your path. If you're getting a command not found
, when you type conda
in your terminal, double check that you installed Anaconda, and your paths are setup correctly (source ~/.bash_profile
or opening a new terminal window are common solutions). You might need to start a new terminal session because conda
may not be in your path until you reload your shell configuration which includes your updated path environmental variable that refers to where Anaconda is installed.
conda install nb_conda
The previous command, should install the nb_conda
package in your root system. This enables Jupyter notebook to use conda environments from the "kernel" menu. The conda environment we will be creating and using for our class will be available after we create it shortly.
Creating and activating the conda environment
This command will create an "Anaconda Environment" called dsi
, which isolates a specific directory on your computer with a specific version of Python (3.6.5), and associated Python libraries that can be contextually used for development of data science projects. This contextual isolation allows us to install and use specific libraries and dependencies for projects we will build in class, without impacting your base system, or other "Anaconda Environments" we may want to configure and use in the future. Using these types of environments are supported industry best practices for managing Python projects.
conda create -n dsi python=3.6.5 anaconda
Before we do anything in class with Python, or Jupyter notebook, please don't forget to activate your environment. This puts our development environment in context to be used.
source activate dsi
Update your packages to the latest
In any terminal, regardless of which directory you are in, you can install the python packages using the conda install [package name]
command.
Install the following packages:
conda install nb_conda=2.2.1 statsmodels=0.8.0 widgetsnbextension=3.0.2 spacy nltk gensim seaborn=0.7.1 scikit-image=0.13.1 scikit-learn=0.19.1 psycopg2 plotly bokeh ipywidgets flask django beautifulsoup4
You should be prompted to install these packages, and you should say "y" for yes to install them. This should install successfully.
Additional Python / Conda Packages
As we need more packages, please use theconda
system at all costs, before using pip to install packages. When you're not sure, ask an instructor.
Are you already familiar with pip? Check out these equivalent Conda commands.
brew.sh OSX Only. Linux students will use apt-get for package management. Windows / Linux users do not need to install brew.
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Instructions are straight forward and listed on site. Brew is a package management system for OSX. Skip this if you are on Windows.
brew install git
- When you've made it this far, open up a terminal and enter the Python interpreter:
Don't forget to
source activate dsi
first!
$ python
Depending on your operating system, your terminal should return something like this:
Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 13:44:09)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
- Next, make sure that the necessary packages are installed. For example, to check that
matplotlib
is installed, type in your terminal:
These versions may have changed slightly since our last install guide iteration. This should be an issue as long as the versions are newer than what's listed.
>>>> import matplotlib
>>>> print matplotlib.__version__
1.5.1
You may see another version (which is OK). If you get an error like this:
$ import matplotlib
ImportError: No module named matplotlib
then you'll need to try to install the Python packages again.
- We'll be using Slack, a popular messaging platform, for our class communications.
- Click on the installation instructions for your platform to install the Slack desktop app. You can also sign into Slack using a web interface or via their mobile app!
Note: Add additional market & cohort-specific channel instructions here, as needed.
-
Chrome is Google's popular web browser, and it comes with a complete set of developer tools built-in. We'll use Chrome to examine code, debug scripts, and view back-end processes. If you don't already have Chrome, make sure to download and install it now.
-
(Optional)Tmate.io is a terminal sharing application.
- Go to the site and follow the directions.
A data scientist frequently writes scripts to process data, perform analysis, and create visualizations, webpages, and other end products, so you'll need a good text editor. If you don't already have a preference, try Atom or Sublime. Both editors are available for most platforms. If you have your own preferences, these are only suggestions and are optional pieces of software.
Instructors should modify these options based on their preferences.
If you are on a Mac, you can install Atom with
$ brew cask install atom
Or Sublime Text with
$ brew cask install sublime-text
- Download the editor of your choice from their website.
- Install the package by double clicking the file icon or from the command line
- Run your editor from the applications menu, or from the command line, like so:
$ subl
$ atom
This example would open up Sublime or Atom, respectively. Whichever editor you choose, be sure to practice using it!
To make it easy for us to help you find files on your machines, it's essential that we can use the locate
command. This command will search an index of files that are indexed on your machine so they are easier to find.
In order to schedule the daily process that will keep your locate database fresh, in OSX, this operation will automatically run your updatedb
script once a day and only needs to be run once:
sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.locate.plist
To have your Linux machine update your locate database everyday, see these directions.
Finally, you'll want to tell git
which editor it should use for your commits.
- If you choose to use Sublime, you would type:
$ git config --global core.editor "subl --wait --new-window"
- If you choose to use Atom, you would type:
$ git config --global core.editor "atom --wait"
Check that you have an ssh key setup first. The follow command should output the contents of your public SSH key to your terminal:
cat ~/.ssh/id_rsa.pub
If you are getting a file not found error, perhaps you don't yet have an SSH key setup yet. Use the following command to setup your ssh key:
$ ssh-keygen -t rsa
Use all defaults, no password, for all prompts. This is a necessary step to allow tmate.io sessions, AWS connectivity, or password-less Github Enterprise interactivity or any future interconnectivity with secure shell sessions.
That's it! Now you're ready to begin GA's Data Science Immersive. See you on the first day of class!