- 10 Years of Git: https://www.atlassian.com/git/articles/10-years-of-git
- Atlassian Git Tutorial: https://www.atlassian.com/git/tutorials/setting-up-a-repository
- Knupp's Dev Styles: http://www.jeffknupp.com/blog/2013/11/15/supercharge-your-python-developers/
- Vim cast: http://vimcasts.org/episodes/
- Knupp's Vim: http://www.jeffknupp.com/blog/2013/12/04/my-development-environment-for-python/
Immediately Useful • HIVE o Files in Linux tend to be stored in text files without schemas o HIVE is a tool for defining schemas to text files, and allowing for SQL-like queries to manipulate data o So it’s pretty easy, but if you need to do things like data steps, then you’ll need to create user defined functions (see below) o Here’s a free book: http://www.semantikoz.com/blog/the-free-apache-hive-book/ • R o Rhonda has set up R on one of the nodes in the Hadoop clusters, so R is a good choice for modeling/analysis o In our setup, R only runs on a single server, so the modeling itself doesn’t actually take advantage of Hadoop, which is mostly used for storage and data prep o You can learn R without a Hadoop environment o This four week class is offered about once a month: https://www.coursera.org/course/rprog o There are tons of free resources on the web. Here is a list of some of them: http://www.ats.ucla.edu/stat/r/ • Linux Operating System o Hadoop runs on Linux clusters o You’ll want to know how to navigate the Linux operating system, and run scripts from the command line o Both R and HIVE have web interfaces available (R Studio and HUE), but you’ll want to run things from the command line to bridge the two together and automate workflows o Some basic bash scripting would be helpful (The consultant wrote some bash scripts, but we plan to translate these to Python eventually… see below) o Learning one standard text editor for the shell. Vim is my choice, and it is nice for programming. Emacs is another one, so is nano. Free online tutorial for VIM here: http://www.openvim.com/ o You can use Virtual Machines to use Linux on a Windows laptop, and you can even practice (limited commands) in your browser here: http://bellard.org/jslinux/ o IT doesn’t allow us to run virtual machines at work, but you can do this at home using free VM software like VirtualBox o There are some free tutorials http://ryanstutorials.net/linuxtutorial/ http://cli.learncodethehardway.org/book/ • Try Hadoop on a laptop: o Hortonworks provides a virtual machine sandbox with tutorials, and you can play around with it o http://hortonworks.com/products/hortonworks-sandbox/ o We are currently on a Cloudera installation, but apparently we are moving to Hortonworks soon
Likely Useful • Spark and the BDAS o The Berkeley Data Analytics Stack is likely to become the most widely used set of tools in Hadoop o Spark is already the most heavily developed tool for Hadoop, and allows for very fast, in-memory processing o Spark SQL allows for SQL interface across many different Hadoop data storage types, including HIVE and eventually HBASE and all the other data stores we are considering o The Spark MLlib is a machine learning library for large-scale, parallelized modeling o Spark can also be used to run lots of R models in parallel across nodes o Here’s one upcoming Spark class: https://www.edx.org/course/introduction-big-data-apache-spark-uc-berkeleyx-cs100-1x • PIG o Is an abstraction to make parallelized analysis of very large data sets easy o Not really for statistics, but more for counts, joining, filtering, etc. o Spark tools may make PIG obsolete, but for now, it’s a common tool o Tutorials http://hortonworks.com/hadoop-tutorial/how-to-use-basic-pig-commands/ http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/ https://pig.apache.org/docs/r0.7.0/tutorial.html • Python o Python can be used to control Hadoop and R workflows (it’s a nice way to control/automate any tasks that you would do in the command line) o Spark has a PySpark interface that allows you to issue Spark commands using Python o User Defined Functions in Hadoop are used to extend functionality in HIVE and PIG, and can be written in Python (or a few other languages) o I have a separate document full of Python resources in the H drive • HBASE o This is another database application that has some advantages over HIVE o So far, we have pointed both HIVE and HBASE to our data o HBASE is less mature, so there are fewer tools to interface with it, however, it seems as if Rhonda and Mark are leaning toward making HBASE the primary way to accessing structured data in Hadoop o Tutorials: http://gethue.com/category/tutorial/