hkilter/Mac for data science.md

## Mac for data science.md

      
    Raw
  

              Mac for data science.md
            
          
    Setting up OS X for Data Science

I had to reinstall my laptop and at the same time I had new team member joining to the team. Therefore I started to write this as a tutorial or check list on how to setup a new MacBook Pro OS X for typical data science development. This is geared towards Scala based development and Spark as that's what we do at the moment. However, I'll start slightly more generally and will add some other things too. Let's start from the basics...
OS X

OS X is great for data science. However, it's missing configurations and apps that you need. Let's get started.
We need a good package manager, text editor, github source control, code editors and so on. But first will look at the command line, Terminal.
Terminal

Open up Terminal. If you don't know where to find it, open Spotlight search and type Terminal into it. Now, right click on it's icon in the Dock. Select Options - Keep in Dock. This way, it's always there when you need it. And you'll need it.
Now, if you haven't used it before, it's opened up with default white settings, called Basic. Open Preferences and then Profiles tab and check out the different profiles. Find the one you like and make it Default. I propose the Pro always.
However, the bash still needs some tuning. Yes, you can tune it the rest of your life, if you wish, but some things I recommend. You need to create a file called .bash_profile in your home directory. This can be done for instance using Emacs. In the open terminal, type emacs ~/.bash_profile and then type in for example the following:
# Paths
# At some point, you will create shell scripts and want to 
# access those from all directories. Or you'll install software
# from command line. It's customary to put these in "bin"
# directory under the home dir. This ads them to the path.
export PATH="~/bin:$PATH"

# General aliases
# The following shows the directory listing with directories
# and the listing is in long format and colorized. I.e. you'll
# see different types (such as symbolic links) easier.
alias ll='ls -la -G'
# this changes the file sizes to more human readable
alias lh='ls -lah -G'
So, now the terminal should look nicer and you could list the contents of the directory with just typing ll and it's much nicer than the default ls :-)
Homebrew

Homebrew, the package manager Apple forgot. Before we rush over to the get it, we need to install Apple's own Xcode as that is needed. So, open App Store and search for Xcode. Yeah, is big. Get a cup of coffee. Once it's installed, open it once to accept the license.
Now head over to Homebrew website and follow the instructions on installing Homebrew. Yes, the part where you past that funny command to your terminal window. Do that.
Proper text editor

Yes, you already got Emacs. What else do you need? Nothing really.
But I'll still recommend Sublime Text. Note that you are free to evaluate Sublime text, but it is a paid software. However, I still recommend that you download it at least for testing. It's great editor and can handle larger text files that you are going to come across.
Scala development

As said in the beginning, we do this slightly from the Scala point of view. We'll be getting to Spark soon, but let's start by installing Scala. But we'll most likely need Java. Easiest way to install it for new OS X is to type java in ther Terminal. Mac will then open a prompt telling you that you need to go and download it. Click on the button that takes you to Oracle. Select the latest JDK installer for Mac OS X and install that. After the installation, you can check that it's there by java -version in the Terminal.
Scala uses SBT for builds. So, let's get that. We'll use homebrew to do that. In Terminal, type
brew install sbt
Now, I believe that sbt will always pull in appropriate Scala to your projects, however, you will like to use the REPL when developing and therefore let's install Scala also:
brew install scala  --with-docs
Now you'll get into Scala REPL by typing scala in the terminal.
Now we want to install the IntelliJ IDEA Community Edition. Head over to JetBrains IntelliJ IDEA download page and select the OS X installer for Community Edition. Latest version asked already in the installer whether you want to install the Scala plugins. You do. When you start your first Scala project, you will need to specify the Project SDK and Scala SDK. For the Project SDK select the Java that you installed. Something like: /Library/Java/JavaVirtualMachines/jdk1.8.0_92.jdk/Contents/Home. Then the Scala SDK. If you start to create a new basic Scala project, then you can either choose from the list of Scala versions given by IDEA (you can download any version) or then you can use the one that you installed with Homebrew (/usr/local/opt/scala/idea). For most projects, you'll use sbt and if you try to create a new sbt-based project, you'll notice that the Scala version is actually defined in the built.sbt file.
Version control

IntelliJ IDEA does have support for version control systems. And you'll probably use them as it's nice to see in the code which lines have changed. However, maybe not all of your coding happens through IntelliJ, so therefore I recommend installing Source Tree for git, mercurial and Github access. Head over to Atlassian SourceTree to download the app. Unfortunately, you will need to create a account and download the license. However, it's good app so this time it's ok burden.
Apache Spark

Installing Apache Zeppelin

Now, there is thing called Zeppelin. It's a notebook style environment that let's use write Spark code and evaluate it and draw simple graphs directly. We'll focus a little on that.
The main page for Apache Zeppelin contains releases that you can download and use. But as it's still incubating, we feel more adventurous and head over to Github to get the latest development version. Now, you'll notice that there is a button on above the files on right hand side, that looks like download to monitor(?). Click on that. It should open the previously installed SourceTree which should ask you where to clone this project. Select a suitable place and go ahead and clone it. Then in the Terminal, change to that directory.
Now you should follow the instructions in the Zeppelin github page. First, it seems that we need Maven and Node.js and It's package manager npm. Again, we'll use Homebrew for this:
brew install maven
brew install npm
You can check the installations by mvn -version and node --version in the terminal.
Node.js is very cool JavaScript runtime for lightweight and efficient service development. If you are planning on creating REST APIs for example, you might want to take a look at it. However, we continue with Zeppelin.
As we went with the latest Github version, we need to build it. There are many options. We'll do quite basic one. Take the latest Spark version (at the moment, it's 1.6) and get the Pyspark also.
mvn clean package -Pspark-1.6 -Ppyspark -DskipTests
If you have external cluster that you want to connect to, then you most likely need Hadoop or Yarn support in there too. Ok, but we go now with this simple one. Go ahead, this will now take some time to download and compile.
Once the laptop fans power down and it says BUILD SUCCESS, we can try it out. First, start it up:
bin/zeppelin-daemon.sh start
Then open up browser at http://localhost:8080/
Yes, you are now ready to develop Zeppelin notebooks and run them on local Spark!
BTW, you stop the zeppelin daemon same way but just with stop as argument. And now you could make a symbolic link to the zeppelin daemon in your ~/bin directory using ln -s and then you could start and stop it without going to the directory itself.
Configuring Apache Zeppelin

At the time of writing, the dev version seems to work :-)
Spark

We'll get to this...