- Date: 2013-02-24
- Author: Gauden Galea
- URL: https://gist.github.com/gauden/5023638
- License: Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
This will be, to a great extent, a hands-on course. Participants should download and install the applications listed below in order to ensure a common operating environment.
I have selected a core set of applications, all of which are free and work on Linux, Windows, or Mac. It is best if you install all these in advance to ensure that we are not held up with download and installation pauses in class.
It is assumed that you will bring your own laptop to which you have administrator access and can install your own software. This exercise will equip you with a set of tools that will be useful long after the course.
We shall be using web-based interfaces a lot and it is easier for me to assume that we are using a common browser that can handle modern standards. I will be using Google Chrome in class, you may have preference for Firefox, Safari, or Opera, and these should work quite consistently.
- Download and install Google Chrome
In data analysis, you will often need to use a text editor to work on data files or source code. Many standard wordprocessors store their files in proprietary encoded formats, and add a lot of formatting information to the text. In order to handle data and text files cleanly, a good text editor becomes an important part of your toolset. If you have a preferred text editor, stick with it. If new to the field, I suggest:
- Mac: TextWrangler
- Windows: NotePad++
- Linux: Kate
- On all three platforms, Sublime Text 2 is an excellent alternative. It is free to try but has a cost of 70 USD if you adopt it.
Many real-world datasets have a lot of imperfections and cleaning them up by hand is tedious. We will try out OpenRefine to help reduce the pain of data cleaning, a free resource that until recently was called Google Refine. The program describes itself thus:
OpenRefine is a power tool that allows you to load data, understand it, clean it up, reconcile it internally, and augment it with data coming from Freebase or other web sources. All with the comfort and privacy of your own computer.
The OpenRefine wiki page has instructions for installation, links to introductory screencasts, and detailed documentation.
R
has become the workhorse of statistical computing. It has a steep learning curve for newcomers but we will learn just enough R
to be able to make useful graphics and on the way provide an introduction to the environment. According to the R Project website:
R is a free software environment for statistical computing and graphics.
- Select a nearby download site, and download the right version for your computer and install
R
. - Roger Peng has an excellent video introduction to installation of
R
on Windows and on Mac - If you are totally new to
R
, then you may find this online course extremely useful: TryR is painless and quickly completed -- ideal if you have time on the weekend before the course.
According to the RStudio website:
RStudio IDE is a powerful and productive user interface for R.
It is in fact a comprehensive collection of tools allowing newcomers a painless entry point to R
and a one-stop computing environment for advanced work as well.
- Download RStudio and install it
- Roger Peng also has a video on installing RStudio on Mac
Some of the graphics you will produce will need further editing afterwards, and there are many high-end proprietary applications that you can use for this. For the course, we will use the standard free and open source application, InkScape, which is available on all platforms.
We will use Dropbox to share folders and resources and avoid playing thumbdrive tag: Dropbox sign-up and install.
Once you start coding, you will find yourself needing to ask questions of a technical nature or to share experiences. It is useful to set yourself up with these websites in order to be able to ask, to answer, and to share experiences in statistical computing in general, and to share snippets of code:
- StackOverflow: for asking and answering questions related to programming. Official tagline: "A language-independent collaboratively edited question and answer site for programmers."
- Github Gists for sharing snippets of code and public texts (such as this page).
- Cross Validated: "a question and answer site for statisticians, data analysts, data miners and data visualization experts." The same account you use for StackOverflow.com can be linked to this community as well.
It is assumed that you already have a spreadsheet installed. If not, install LibreOffice -- it has an excellent spreadsheet component called Calc.
- A pdf version of this page is also available.
- In total, these programs represent a hefty download and this will take some time, so do not try to do it at the last minute or during the course as this will cause significant delays to the group.
- All the programs listed are free, not all are open source.
- The choice of programs does not imply they are the "best" in any category -- they are simply the standard proposed for this course.
- All trademarks are the property of their respective owners.
Hello Gauden, did you already create a dropbox folder for our course?
Cheers, Max