Skip to content

Instantly share code, notes, and snippets.

@gpfreitas
Created November 28, 2015 00:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gpfreitas/334dc2a6c0bac16a71f6 to your computer and use it in GitHub Desktop.
Save gpfreitas/334dc2a6c0bac16a71f6 to your computer and use it in GitHub Desktop.
Resources for someone who wants to learn to use UNIX-like systems from the command-line, with some focus on data analysis

Basic Tools for Data Analysis in a UNIX Environment

Author

Guilherme Freitas <guilherme@gpfreitas.net>

Contents

Overview and Definitions

This is not a tutorial, but there are links to tutorials and howtos here.

This documents lists some classic and useful UNIX tools and references where you can learn more about them. If you are learning your way around the command line, this is a good place to start. By UNIX here I mean a system like Linux, Mac OS X or FreeBSD. These systems are very similar, but not identical; watch out for discrepancies and always consult the manual page (henceforth manpage) of the command you are using by typing man command in the command line. If you want to avoid discrepancies in your own code, try to constrain yourself to features specified int he POSIX standard (those will be the same on all UNIX-like systems with very, very high probability).1

I also decided to include some references that are not really classic UNIX tools, but that work well with the UNIX philosophy and are very useful for data analysis.

What is a shell?

The shell of your UNIX OS is the command line interpreter. It's the place where you type commands and see output. You can also run more complex programs in it, like a text editor. For more about this, see:

What is the exact difference between a 'terminal', a 'shell', a 'tty' and a 'console'?

In this document we will focus on the bash shell, and hopefully I'll avoid features that are specific to bash. Ideally, I'd like everything that is here to be compatible with the sh shell.

Commands, utilities and programming languages

ls, grep and awk are all programs that can be called from the shell, listed in increasing complexity: ls lists the files in a directory, grep matches patterns in files or streams of data and awk is a programming language in its own right, oriented towards line/record-wise data processing. The beauty of those tools is that you can weave them all together in a shell pipeline or a shell script.

List of Useful UNIX Tools

These are essential. Check their man pages or do a web search for their functionality.

  • man: format and display the on-line manual pages
  • bash: GNU Bourne-Again SHell
  • cd: change working directory
  • ls: list directory contents
  • less: opposite of more. It also lets you look inside files without loading it all in memory
  • cp, rm, mv: copy, remove and move files (respectively)

The following commands are useful for data slicing, filtering, sorting and display

  • sort: sort lines of text files (even if your file does not fit in RAM!)
  • cut: cut out selected portions of each line of a file
  • grep: file pattern searcher
  • wc: word, line, character, and byte count
  • uniq: report or filter out repeated lines in a file
  • column: columnate lists
  • head: display first lines of a file
  • tail: display last lines of a file
  • tr: translate characters (great if you need to convert to/from tab characters!)

Note: with the tools above, you can already implement a simple histogram. Suppose file.txt has two columns and you want a quick-and-dirty histogram of the values in the second column. You can do this with:

cut -f2 file.txt | sort -n | uniq -c

Very important: the | symbol is the pipe operator. It allows you to chain the output of the previous command to the input of the following command.

The output of the command above may look a bit ugly due to the lack of alignment; fix it by running instead:

cut -f2 file.txt | sort -n | uniq -c | column -t

If instead of file.txt you had a large gzipped file file.txt.gz, you can decompress the data and pipe the decompressed data as it becomes available straight into the commands above:

gzip -dc file.txt.gz | cut -f2 | sort -n | uniq -c | column -t

At this point you might want to learn what standard input (stdin) and standard output (stdout) stand for. You might as well learn what stderr means. (Do a web search for bash input output redirection).

Other useful commands:

  • find: walk a file hierarchy
  • cat: concatenate and print files
  • paste: merge corresponding or subsequent lines of files
  • join: relational database operator
  • comm: select or reject lines common to two files

Non-standard (meaning you might have to install it yourself), but useful:

  • pv: monitor the progress of data through a pipe

Sed and AWK

There are two classic utilities that are more complex than the ones listed above, but are also standard and worth knowing: sed and awk. The first is a stream editor, very useful for line-oriented substitutions, for example. The second one is a line and record-oriented programming language very useful for more general, but simple data processing.

You might want to check out the links below for examples of data analysis with some of these tools.

If you want to have a record of how long your shell commands are taking, preface them with the time command. For example, to sort a long file and time it:

time sort large_file.txt

If you want to have a record of how much data is being piped through your commands and an estimate of when they will finish, pipe the data file with pv. For example, to replace 'bread' with 'flour' in a very large file, do:

pv very_large_file.txt | sed 's/bread/flour/g'

Useful Unix references

On text editors

One usually spends a lot of time editing text in UNIX, be it source code or regular text. It pays off to learn one editor well (especially one of the two powerful classic ones, Vim and Emacs). It is also a good idea to be at least familiar with the other one. To get things done immediately, try nano, a simple and user-friendly text editor that is usually installed.

Building stuff

Building a final product requires building its components in an order that respects dependency relations (walls before windows, etc.). A program that helps you build things like that is a build tool. There are many out there, but by far the most well-known one is the make utility. GNU make is the most widespread implementation. Again, you can restrict yourself to POSIX features to increase the chance your code works across platforms.

You can use make to build software, documents or to represent workflows like "create this folder, compile this code, then erase all the intermediate files". For example, a lot of software can be installed by just running make; make install. Or you could build a PDF from a TeX file with:

make mydocument.pdf

and then erase all the intermediate TeX files with:

make clean

make will keep track of changes in a target's dependencies, so, if you type make mydocument.pdf again, it will tell you that mydocument.pdf is up-to-date; if you edit the source mydocument.tex though, then running make mydocument.pdf will rebuild the PDF because make will notice that a dependency changed. This is not magic: these dependencies and build instructions have to be encoded in a Makefile. You can learn how to do that in make's documentation or a host of tutorials on the web.

Another well-known build tool is CMake. There are many others, like Maven in the Java world, and sbt for Scala projects. Some notable alternatives written in Python are doit, waf and SCons.

Unless you have a good reason to do otherwise (you will know if you do), it's probably best to just start with (and possibly stick with) make.

Fetching things from the network

The two main tools are curl and wget. Curl is more of a way to make HTTP requests (not only that though) while wget is a way to download files and mirror websites.

If you want to download an entire website and all the files it points to, use wget; if you want to call a web API from the command-line, use curl. If all you need is to download a single file, either will do.

You can also use curl and wget for FTP in a limited way.

Networking is a world in itself and there are _many useful tools available at the command-line. For synchronizing directories or files with local or remote files/directories, use rsync (my usual flags are -avz). Netcat (nc) is a very versatile tool for sending/receiving data over the network, and ifconfig is what you use for figuring out your local network cards configuration.

Remote login, execution and file transfers

The following commands are useful when you need to work with machines other than the one in front of you:

  • ssh: OpenSSH SSH client (remote login program).
  • scp: secure copy (remote file copy program)
  • sftp: secure file transfer program

SSH is very handy when you work with machines other than your personal/work machine. Make sure you know how to use ssh-keygen, and ~/.ssh/config to set up secure, passwordless ssh logins for your remote machines. When doing that, __do use a passphrase for your private key!__. As of 2015-11-26, I use ssh-keygen -t rsa -b 4096 -C "guilherme@gpfreitas.net" to generate my keys. If you set things up right, you will only have to enter your passphrase once per OS _session (this may require using ssh-agent or ssh-add depending on the operating system, it's worth doing a web search here).

Job control

Job control is not intrinsically related to ssh, but it is often used in conjunction, because people tend to execute tasks in a remote machine, logout, and come back to see results.

If you want to keep your commands running in a remote machine after you log out, you can use a terminal multiplexer: screen is the classic choice, but these days there aren't many excuses for not using tmux. It is newer, more maintainable, easier to configure and has seen wide adoption. I hear it also has more features. Do a web search with "screen ssh" and "tmux ssh" to see how to use them to leave a job running in a remote machine and come back to it later. Terminal multiplexers have many other uses too.

Alternatively, you can wrap command with nohup command &. Make note of the process ID that your process will have and check later if it's done with jobs. There is a lot more to job control on the shell, and you may want to search your manual pages or the web for documentation and tutorials.

Plotting and Visualization

You probably already have a software of choice for plotting, but let me try to make you consider GNU plotutils for certain jobs, especally if you want to quickly plot things coming out of a shell pipeline. It is very fast and it will plot and update the plot in real time (so you can pipe data into it and it will plot it as the data comes). Visualization is a huge area and it's worth doing your own search based on what environment you will use for plotting. For inspiration, check out the following projects:

  • ggplot2, an R package that implements a "grammar of graphics" (you may want to search the web for that term, books, etc.)
  • D3.js, a very flexible Javascript library for static of highly interactive data driven documents. -
  • Vega, a JSON-based declarative format for creating, saving, and sharing visualization designs.
  • Metapost, a drawing language that is very nice for generating vector graphics of technical diagrams and pictures. It integrates very nicely with TeX, and as such it is very good for generating beautifully typeset math and text as well as pictures.
  • Graphviz for visualizing graphs.
  • Processing for generative/algorithmic art.

Some notable ommissions: vector drawing programs (like Inkspace or Illustrator) which are often useful, as well as any mapping tools (QGIS, ArcGIS, PostGIS, Leaflet.js, etc.)

A lot of visualization these days are done at the browser, using Javascript. D3 is one such tool, but there are _many more. A new trend is also to try to leverage the power of graphics cards to drive these visualizations (see, for example, VisPy).

Storing data

Comma-Separated-Values (CSV) files and JSON are the main ways of storing data in text files these days. You may also have variants like tab-separated files.

In addition, you may encounter YAML files, or XML files. I strictly prefer CSV, JSON and YAML to XML though.

The usual shell tools (awk, sed, grep, sort, uniq, etc.) are usually great for working with CSV files. However, some CSV files are not super well formed, and in those cases you will need more powerful tools that usually as libraries for different programming languages. From the shell though, you may want to look at csvkit for CSV files, and jq for JSON files.

If you know or want to learn SQL and want to have a simple SQL database in a file, use SQLite. No need to setup a database server. For small datasets (less than, say, 2 gigabytes), you should be fine with SQLite.

For a lot more on command-line data processing tools, check out

http://datascienceatthecommandline.com/

Why Not Just Use Python or Perl?

I love Python. And you will probably like it too. You can do all this I'm telling you about in Python. It's not hard, and you can do everything from simple shell scripts, to web services, web portals, scientific computing, data analysis, machine learning, etc.

The problem is that for simple tasks, in my experience, your code will be much longer and it won't be any faster than using standard Unix tools. It is also not as easy to do simple things like switching directories and firing up a text editor from the Python propt (unless you use something like IPython).

The upside of using a real programming language like Python is that you can do many more tasks, in a single "way" (not mixing various shell tools), preserving readability and ease of debugging. If you find yourself writing scripts that are "too long", or too hacky, or even too slow, maybe it's time to drop the classic UNIX tools and use a real programming language.

A lot of what I wrote here about Python applies to other languages: I know nothing about Perl, but don't let that discourage you. It's often used for the types of tasks that are just above UNIX tools comfort zone. Ruby is also very popular and natural, but its ecosystem was built more towards web applications than data analysis. Lisp is a classic and extremely flexible family of languages where data and logic are intertwined. Don't let the parentheses-heavy syntax discourage you. For a good example of a Lisp with amazing tooling, see Racket. Haskell sounds really cool, and I love the syntax and the functional way of thinking. Julia is a new programming language that is particularly suited for data and numerical work.

That said, my personal recommendation as of Nov 2015 is to start with Python. It's very beginner-friendly, but also has powerful features. It's _very versatile. It comes with an amazing standard library and a whole suite of libraries for all sorts of tasks (just search scipy and pydata for the scientific computing anda data analysis tools). It has great tooling (debuggers, documentation generators, testng frameworks, interpreters, notebook-like interfaces, etc.). Last but not least, the community is very diverse and very friendly, probably more so than any other sizable programming-language-related community out there.

Some personal picks from the Python world: the os module to handle system-level procedures in a portable way, the itertools module and maybe the functools module if you are dealing with large files. You will probably also want to read on the csv and json modules. If you need to manipulate matrices or efficient arrays, use NumPy; for more Scientific tools, use SciPy ; for convex optimization, use CVXPY or CVXOPT; for data analysis, check out pandas and other references in http://pydata.org/. For graph theory, check out NetworkX .Or just use a everything-and-the-kitchen-sink approach; install Anaconda Python or Enthought Canopy or Sage.

Version Control

If your work involves writing a lot of plain text (like code, or any markup language like TeX or HTML), then you should learn to use a version control system (VCS). I would suggest git because it is the most common tool for open source projects (hosted usually on GitHub, but also see BitBucket and Gitlab). Here are some good resources:

  • The book Pro Git is available online and is an excellent resource. Like everything, there is a learning curve, but this one is well worth it!
  • If you are going to use Git at first to manage your individual projects, have a look at Everyday GIT With 20 Commands Or So. It will show you which commands are needed, and which ones are not.
  • Git has lots of documentation available at your fingertips. For example, you can see the same content as in the "Everyday Git..." link with man giteveryday. Try also man git and man gittutorial, and look at the "see also" sections in those manual pages.
  • Git has a very inconsistent interface but is conceptually not very hard to understand. For that reason, I strongly suggest the presentation Git Core Concepts by Ted Naleid. In particular, you will understand that git branches are just automatically moving references to certain commits. Once you understand the basic concepts (commits and references being the most fundamental), you can focus on those, create your aliases and not worry about a lot of the craziness of the interface. The presentation also collects some useful hints and aliases to put in your ~/.gitconfig. You can obtain similar information with man gitcore-tutorial.
  • It is a lot easier to manage an individual project with Git than group projects. That said, it may be worth to adopt "group-style" workflows as early as possible as you may never know when other people will contribute to your project, or when you are going to have to contribute to a new one. A good starting point is the feature branch workflow (also known as the GitHub Flow workflow). It is a very popular workflow for small teams that is very easy to use in individual projects but requires everyone to have write access to the master remote repository. In open source projects that may not be desirable, and something like the forking workflow may be a better fit. A very popular workflow in larger teams and enterprise environments is git flow. It is essentially the forking workflow with some pre-defined roles for maintenance branches, release branches, hotfix branches, etc. Here are the original explanation, a very succint introduction, a great cheat sheet and the GitHub repo of the git flow tool.
  • If you want a remote home for your project, consider GitHub first, but also look at alternatives like Gitlab and BitBucket, especially if you want freely-hosted private repositories.
  • Make meaningful commits (use git stash for random stop/pause points in your development) and write good, descriptive commig messages.
  • Follow Tim Pope's notes on git commit messages unless you know what you are doing and have some reason to ignore his advice.
  • After you have used Git for a while, go and check these 19 tips for everyday git use.

Other noteworthy but less used distributed version control systems (DVCS) that are similar to Git are mercurial, bazaar, and, one of my favorites, fossil.

Writing Text or Documenting Your Work

We all should document what we do. I guess the Wiki is the right place to put things, but if you write code, the documentation should go with the code.

For the documentation, you will have a lot of choices of light markup languages. I tend to use reStructured Text (reST) because it is heavily used by Python coders and because it's nice and powerful (if somewhat more complex and messy than other alternatives). I have used txt2tags before, and it was also very nice. Markdown is probably the most popular one and very simple (but also limited). I have heard very nice things about asciidoc and textile. I know GitHub automatically and beautifully renders documents written in markdown reST, and asciidoc, so if you want a suggestion, I would stick with one of those three. Asciidoc can be used to publish books (some of the O'Reilly technical books were written in Asciidoc), and can export to Docbook, among other things. To document software, reST + Sphinx is a very nice combination in the Python world, but it can also be used elsewhere (C and C++). Different languages have different documentation practices, tools and traditions.

For writing beautifully-rendered PDF reports, especially if they contain any mathematics or technical graphs, TeX is hard to beat (but consider asciidoc and reST). LaTeX is by far the most popular way of using TeX's power, but I have a speciall fondness for ConTeXt. You can find links to everything TeX in the TeX User's Group page. I have written technical documents with reST (with MathJax if the report will be read in a browser), and I heard the same can be done with asciidoc, but the results were not nearly as aesthetically pleasant as the ones I obtained with TeX-based tools.

Pandoc is a good tool for converting between various markup languages.

For technical reports or interactive documents where the code actually matters, consider using the IPython. Check out this gallery of IPython notebooks to see what I am talking about. If you search around the web, you will see that you can use other languages with IPython, like R, Julia, Scala and Ruby.

Hint: I forgot where I read this, but I find it very useful. If you are writing in a markup language, consider writing one sentence/clause per line. It will be a lot easier to find text snippets in your file (no risk of not finding "on me" because "on" is in one line, and "me" is in another), and the plethora of line-oriented text tools that exist or can be built in UNIX will be that much easier and pleasant to use. As a bonus, you will be able to immediately to spot long clauses. If you additionally indent every line that is not the beginning of a sentence, you will also be able to spot long sentences! Finally, rewriting your sentences and finding what changed between versions (with the diff tool) will become much more natural.

Package Managers

Most tools mentioned here are present on most Unix-like systems. However, some tools will have to be installed (for example, wget in a Mac or most of the tools mentioned in the documentation section). As every program mentioned here is open source and freely available, you could download the source code and compile it yourself. In most cases there is a more practical solution though: use a package manager.

A package manager allows you to install a program and all its dependencies with a single command.

If you are using Linux or FreeBSD you are probably already familiar with your package manager, but if you are on a Mac, maybe you haven't. The most used solution among developers as of Nov 2015 is Homebrew. If Homebrew doesn't work well for you, you may want to check out Macports. If you all you want is a few extra shell tools (like wget) for your mac, you may consider using Rudix.

On Windows, make sure you check out Cygwin. It provides a POSIX compatibility layer, so you can use a UNIX-like shell and almost all if not all the tools mentioned here. In addition, it provides a bare-bones package manager.

Further References

The web has tons more of references. If you want books, check out

Of course there are other good publishers. But these cover the vast majority of the books I find useful.

O'Reilly has a permanent deal where you buy 2 books and get the third one for free. You have to put the promo code in there though (currently, OPC10).


  1. You can search for POSIX man pages in http://www.duckduckgo.com by, for example, searching !posix make for the man page of the make utility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment