Skip to content

Instantly share code, notes, and snippets.

@dyerrington
Last active December 31, 2019 00:12
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save dyerrington/9d6eb0895b1bb0d22d733887e96ff3bc to your computer and use it in GitHub Desktop.
Save dyerrington/9d6eb0895b1bb0d22d733887e96ff3bc to your computer and use it in GitHub Desktop.

Data Science Immersive "Installfest"

DSI Computer Setup

Welcome to GA's Data Science Immersive! Before you start class, you'll need to download and install a few tools. Follow this guide to get your computer all set up, and let us know if you have any questions.

Operating System Concerns

While you can be a data scientist on any operating system, most practicing data scientists choose a Unix-type operating system, typically either MAC or a popular linux distribution such as Ubuntu or Linux Mint.

  • If you are already using Mac or Linux, great! Skip ahead to Step 2 and get started with your installs.

Windows Install Instructions

Windows / PC

Skip this section if you're on MAC or Linux.

From here on out, you have 2 options. It's recommended that you use Anaconda for Windows, and you can install it using the Python 3.6 graphical installer here (use 64 or 32 bit depending on your flavor of Windows):

https://www.anaconda.com/download/#windows

After you install Anaconda, and Gitbash, follow the original guide that is designed for MAC. In a nutshell, the abridged version of the Anaconda setup for Windows (while in a Gitbash session), should be:

Create your conda environment

conda create -n dsi python=3.6.5 anaconda

Activate your dsi environment

activate dsi

Install additional packages

conda install nb_conda=2.2.1 statsmodels=0.8.0 widgetsnbextension=3.0.3 spacy nltk gensim seaborn=0.7.1 scikit-image=0.13.1 scikit-learn=0.19.1 psycopg2 plotly bokeh ipywidgets flask django beautifulsoup4

Then, just install Chrome web browser.

(Optional) Install Docker for Windows

Another option you can consider, is using Docker. Basically, you can be up and running within a very short period of time if your windows system doesn't have too many Firewalls or anti-virus applications to modify. This is nice to have, but if you can't get it to work within an hour, use the preferred Anaconda setup.

  • Please see the guidelines for installing Docker.

After you've successfully installed Docker, simply create the container using the following from gitbash. This will create a virtual machine instance called dsi that you can start and stop in the future.

  1. Create Docker container for dsi environment, then start jupyter notebook.

    docker run -d -p 8888:8888 -v `pwd`:/home/jovyan --name dsi jupyter/scipy-notebook
    
  2. Get the url of your running jupyter server on your dsi container, which includes the token necessary to get access:

    docker exec dsi jupyter notebook list
    
  3. To access jupyter paste the url from your terminal that looks like http://localhost:8888/?token=b6965133171f7f5fccc788cf4f55f3a4917a07f0e816a48a into your browser. Replace the token parameter with the reference from step 2.

Troubleshooting Windows Docker Problems

  • If the "Docker" command doesn't respond, make sure you have installed Docker correctly and that it's in the path. It's also necessary if you run the docker command via gitbash rather than DOS.

  • Can't connect to your notebook via web browser? Check that your firewalls are allowing access, or turn them off. This is a common problem but unfortunately, not all configurations are the same so this may take some research in order to work correctly. Please be patient because Windows machines aren't the easiest to support due to the market adaptation of Unix based systems for the development environment data science prefers.

Step 1: Anaconda and Python

Linux users do not do this. Your setup is complete in terms of development environment.

In our class, we'll be working closely with tools that utilize the Python programming language. Anaconda is a popular cross-platform tool that helps install and manage python-related data science libraries. While you may have set this up prior to the class, perhaps as instructed by our prework platform, it's important that we're all setup with the same version for class.

Previously Installed Anaconda?

Please refer to your local instructor for the proper uninstallation instructions.

  1. Download Anaconda and follow the installation instructions package for your operating system. For MAC, use the macOS graphical install guide, with the Python 3.6 Anaconda package file.

  1. Agree to the terms and let Anaconda go through its default installation.

  1. Anaconda should install several packages by default, including:
  • python: a programming language very popular with data scientists
  • jupyter: an interface for creating interactive python notebooks, great for sharing analyses
  • matplotlib: a plotting library for python
  • nltk: a toolkit for natural language processing
  • numpy: a linear algebra library
  • pip & setuptools: software to manage and install python packages
  • scikit-learn: a toolkit for machine learning algorithms
  • scipy and statsmodels: statistical packages for python
  • sqlite: a popular, easy to use database
  1. We will be using Conda virtual environments. "But why" you might ask?! Everyone has different versions of libraries, system tools, and underlying operating system resources. Using a Conda virtual environment helps mitigate the differences everyone's system brings, with a consistent baseline development environment, and should reduce problems overall.

IMPORT FIRST STEP!
Verify that conda is setup and in your path. If you're getting a command not found, when you type conda in your terminal, double check that you installed Anaconda, and your paths are setup correctly (source ~/.bash_profile or opening a new terminal window are common solutions). You might need to start a new terminal session because conda may not be in your path until you reload your shell configuration which includes your updated path environmental variable that refers to where Anaconda is installed.

conda install nb_conda

The previous command, should install the nb_conda package in your root system. This enables Jupyter notebook to use conda environments from the "kernel" menu. The conda environment we will be creating and using for our class will be available after we create it shortly.

Creating and activating the conda environment
This command will create an "Anaconda Environment" called dsi, which isolates a specific directory on your computer with a specific version of Python (3.6.5), and associated Python libraries that can be contextually used for development of data science projects. This contextual isolation allows us to install and use specific libraries and dependencies for projects we will build in class, without impacting your base system, or other "Anaconda Environments" we may want to configure and use in the future. Using these types of environments are supported industry best practices for managing Python projects.

conda create -n dsi python=3.6.5 anaconda

Before we do anything in class with Python, or Jupyter notebook, please don't forget to activate your environment. This puts our development environment in context to be used.

source activate dsi

Update your packages to the latest

In any terminal, regardless of which directory you are in, you can install the python packages using the conda install [package name] command.

Install the following packages:

conda install nb_conda=2.2.1 statsmodels=0.8.0 widgetsnbextension=3.0.2 spacy nltk gensim seaborn=0.7.1 scikit-image=0.13.1 scikit-learn=0.19.1 psycopg2 plotly bokeh ipywidgets flask django beautifulsoup4

You should be prompted to install these packages, and you should say "y" for yes to install them. This should install successfully.

Additional Python / Conda Packages
As we need more packages, please use the conda system at all costs, before using pip to install packages. When you're not sure, ask an instructor.
Are you already familiar with pip? Check out these equivalent Conda commands.

Install Brew + Git

brew.sh OSX Only. Linux students will use apt-get for package management. Windows / Linux users do not need to install brew.

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Instructions are straight forward and listed on site. Brew is a package management system for OSX. Skip this if you are on Windows.

OSX

brew install git

Step 2: Confirm Your Python Installation

  1. When you've made it this far, open up a terminal and enter the Python interpreter:

Don't forget to source activate dsi first!

$ python

Depending on your operating system, your terminal should return something like this:

Python 3.6.5 | packaged by conda-forge | (default, Apr  6 2018, 13:44:09) 
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 
  1. Next, make sure that the necessary packages are installed. For example, to check that matplotlib is installed, type in your terminal:

These versions may have changed slightly since our last install guide iteration. This should be an issue as long as the versions are newer than what's listed.

>>>> import matplotlib
>>>> print matplotlib.__version__
1.5.1

You may see another version (which is OK). If you get an error like this:

$ import matplotlib
ImportError: No module named matplotlib

then you'll need to try to install the Python packages again.

Additional Software

  1. We'll be using Slack, a popular messaging platform, for our class communications.

Note: Add additional market & cohort-specific channel instructions here, as needed.

  1. Chrome is Google's popular web browser, and it comes with a complete set of developer tools built-in. We'll use Chrome to examine code, debug scripts, and view back-end processes. If you don't already have Chrome, make sure to download and install it now.

  2. (Optional)Tmate.io is a terminal sharing application.

  • Go to the site and follow the directions.

Additional Text Editors

A data scientist frequently writes scripts to process data, perform analysis, and create visualizations, webpages, and other end products, so you'll need a good text editor. If you don't already have a preference, try Atom or Sublime. Both editors are available for most platforms. If you have your own preferences, these are only suggestions and are optional pieces of software.

Instructors should modify these options based on their preferences.

If you are on a Mac, you can install Atom with

$ brew cask install atom

Or Sublime Text with

$ brew cask install sublime-text
  1. Download the editor of your choice from their website.
  2. Install the package by double clicking the file icon or from the command line
  3. Run your editor from the applications menu, or from the command line, like so:
$ subl
$ atom

This example would open up Sublime or Atom, respectively. Whichever editor you choose, be sure to practice using it!

(Optional) Index Your Filesystem

To make it easy for us to help you find files on your machines, it's essential that we can use the locate command. This command will search an index of files that are indexed on your machine so they are easier to find.

OSX updatedb index

In order to schedule the daily process that will keep your locate database fresh, in OSX, this operation will automatically run your updatedb script once a day and only needs to be run once:

sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.locate.plist

Linux updatedb index

To have your Linux machine update your locate database everyday, see these directions.

+ Configure Git with your Text Editor

Finally, you'll want to tell git which editor it should use for your commits.

  • If you choose to use Sublime, you would type:
$ git config --global core.editor "subl --wait --new-window"
  • If you choose to use Atom, you would type:
$ git config --global core.editor "atom --wait"

+ SSH Setup

Check that you have an ssh key setup first. The follow command should output the contents of your public SSH key to your terminal:

cat ~/.ssh/id_rsa.pub

If you are getting a file not found error, perhaps you don't yet have an SSH key setup yet. Use the following command to setup your ssh key:

$ ssh-keygen -t rsa

Use all defaults, no password, for all prompts. This is a necessary step to allow tmate.io sessions, AWS connectivity, or password-less Github Enterprise interactivity or any future interconnectivity with secure shell sessions.

That's it! Now you're ready to begin GA's Data Science Immersive. See you on the first day of class!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment