Basic Tools for Data Analysis in a UNIX Environment
|Author:||Guilherme Freitas <firstname.lastname@example.org>|
- Overview and Definitions
- List of Useful UNIX Tools
- Why Not Just Use Python or Perl?
- Version Control
- Writing Text or Documenting Your Work
- Package Managers
- Further References
This is not a tutorial, but there are links to tutorials and howtos here.
This documents lists some classic and useful UNIX tools and references where
you can learn more about them. If you are learning your way around the command
line, this is a good place to start. By UNIX here I mean a system like Linux,
Mac OS X or FreeBSD. These systems are very similar, but not identical; watch
out for discrepancies and always consult the manual page (henceforth manpage)
of the command you are using by typing
man command in the command line. If
you want to avoid discrepancies in your own code, try to constrain yourself to
features specified int he POSIX standard (those will be the same on all
UNIX-like systems with very, very high probability). 
I also decided to include some references that are not really classic UNIX tools, but that work well with the UNIX philosophy and are very useful for data analysis.
The shell of your UNIX OS is the command line interpreter. It's the place where you type commands and see output. You can also run more complex programs in it, like a text editor. For more about this, see:
In this document we will focus on the
bash shell, and hopefully I'll avoid
features that are specific to
bash. Ideally, I'd like everything that is
here to be compatible with the
ls, grep and awk are all programs that can be called from the shell, listed in increasing complexity: ls lists the files in a directory, grep matches patterns in files or streams of data and awk is a programming language in its own right, oriented towards line/record-wise data processing. The beauty of those tools is that you can weave them all together in a shell pipeline or a shell script.
These are essential. Check their man pages or do a web search for their functionality.
man: format and display the on-line manual pages
bash: GNU Bourne-Again SHell
cd: change working directory
ls: list directory contents
less: opposite of more. It also lets you look inside files without loading it all in memory
mv: copy, remove and move files (respectively)
The following commands are useful for data slicing, filtering, sorting and display
sort: sort lines of text files (even if your file does not fit in RAM!)
cut: cut out selected portions of each line of a file
grep: file pattern searcher
wc: word, line, character, and byte count
uniq: report or filter out repeated lines in a file
column: columnate lists
head: display first lines of a file
tail: display last lines of a file
tr: translate characters (great if you need to convert to/from tab characters!)
Note: with the tools above, you can already implement a simple histogram. Suppose file.txt has two columns and you want a quick-and-dirty histogram of the values in the second column. You can do this with:
cut -f2 file.txt | sort -n | uniq -c
Very important: the
| symbol is the pipe operator. It allows you to
chain the output of the previous command to the input of the following command.
The output of the command above may look a bit ugly due to the lack of alignment; fix it by running instead:
cut -f2 file.txt | sort -n | uniq -c | column -t
If instead of
file.txt you had a large gzipped file
can decompress the data and pipe the decompressed data as it becomes available
straight into the commands above:
gzip -dc file.txt.gz | cut -f2 | sort -n | uniq -c | column -t
At this point you might want to learn what standard input (
standard output (
stdout) stand for. You might as well learn what
means. (Do a web search for
bash input output redirection).
Other useful commands:
find: walk a file hierarchy
cat: concatenate and print files
paste: merge corresponding or subsequent lines of files
join: relational database operator
comm: select or reject lines common to two files
Non-standard (meaning you might have to install it yourself), but useful:
pv: monitor the progress of data through a pipe
There are two classic utilities that are more complex than the ones listed
above, but are also standard and worth knowing:
awk. The first
is a stream editor, very useful for line-oriented substitutions, for example.
The second one is a line and record-oriented programming language very useful
for more general, but simple data processing.
You might want to check out the links below for examples of data analysis with some of these tools.
If you want to have a record of how long your shell commands are taking,
preface them with the
time command. For example, to sort a long file and time it:
time sort large_file.txt
If you want to have a record of how much data is being piped through your
commands and an estimate of when they will finish, pipe the data file with
pv. For example, to replace 'bread' with 'flour' in a very large file, do:
pv very_large_file.txt | sed 's/bread/flour/g'
- http://matt.might.net/articles/basic-unix/ (start here if you are completely new to the command line)
- http://www.grymoire.com/Unix/index.html (many tutorials on Unix-related things)
- http://www.ibm.com/developerworks/linux/library/l-bash/index.html (learn the bash shell through examples)
- http://mywiki.wooledge.org/BashFAQ (great reference for the Bash shell)
- http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics (multiple examples of multiple commands)
- http://www.stanford.edu/class/cs124/kwc-unix-for-poets.pdf (programmatic text processing with Unix tools)
- http://www.pement.org/awk/awk1line.txt (examples of awk one-liners)
- http://sed.sourceforge.net/sed1line.txt (examples of sed one-liners)
- http://unix.stackexchange.com/ (great question and answer website)
One usually spends a lot of time editing text in UNIX, be it source code or
regular text. It pays off to learn one editor well (especially one of the two
powerful classic ones, Vim and Emacs). It is also a good idea to be at least
familiar with the other one. To get things done immediately, try
simple and user-friendly text editor that is usually installed.
Building a final product requires building its components in an order that respects dependency relations (walls before windows, etc.). A program that helps you build things like that is a build tool. There are many out there, but by far the most well-known one is the make utility. GNU make is the most widespread implementation. Again, you can restrict yourself to POSIX features to increase the chance your code works across platforms.
You can use
make to build software, documents or to represent workflows
like "create this folder, compile this code, then erase all the intermediate
files". For example, a lot of software can be installed by just running
make install. Or you could build a PDF from a TeX file with:
and then erase all the intermediate TeX files with:
make will keep track of changes in a target's dependencies, so, if you type
make mydocument.pdf again, it will tell you that
up-to-date; if you edit the source
mydocument.tex though, then running
make mydocument.pdf will rebuild the PDF because
make will notice that
a dependency changed. This is not magic: these dependencies and build
instructions have to be encoded in a
Makefile. You can learn how to do that
in make's documentation or a host of tutorials on the web.
Unless you have a good reason to do otherwise (you will know if you do), it's
probably best to just start with (and possibly stick with)
If you want to download an entire website and all the files it points to, use wget; if you want to call a web API from the command-line, use curl. If all you need is to download a single file, either will do.
You can also use curl and wget for FTP in a limited way.
Networking is a world in itself and there are _many_ useful tools available at
the command-line. For synchronizing directories or files with local or remote
rsync (my usual flags are
-avz). Netcat (
is a very versatile tool for sending/receiving data over the network, and
ifconfig is what you use for figuring out your local network cards
The following commands are useful when you need to work with machines other than the one in front of you:
ssh: OpenSSH SSH client (remote login program).
scp: secure copy (remote file copy program)
sftp: secure file transfer program
SSH is very handy when you work with machines other than your personal/work
machine. Make sure you know how to use
set up secure, passwordless ssh logins for your remote machines. When doing
that, __do use a passphrase for your private key!__. As of 2015-11-26, I use
ssh-keygen -t rsa -b 4096 -C "email@example.com" to generate my keys.
If you set things up right, you will only have to enter your passphrase once
per OS _session_ (this may require using
on the operating system, it's worth doing a web search here).
Job control is not intrinsically related to ssh, but it is often used in conjunction, because people tend to execute tasks in a remote machine, logout, and come back to see results.
If you want to keep your commands running in a remote machine after you log out, you can use a terminal multiplexer: screen is the classic choice, but these days there aren't many excuses for not using tmux. It is newer, more maintainable, easier to configure and has seen wide adoption. I hear it also has more features. Do a web search with "screen ssh" and "tmux ssh" to see how to use them to leave a job running in a remote machine and come back to it later. Terminal multiplexers have many other uses too.
Alternatively, you can wrap
nohup command &. Make note of
the process ID that your process will have and check later if it's done with
jobs. There is a lot more to job control on the shell, and you may want to
search your manual pages or the web for documentation and tutorials.
You probably already have a software of choice for plotting, but let me try to make you consider GNU plotutils for certain jobs, especally if you want to quickly plot things coming out of a shell pipeline. It is very fast and it will plot and update the plot in real time (so you can pipe data into it and it will plot it as the data comes). Visualization is a huge area and it's worth doing your own search based on what environment you will use for plotting. For inspiration, check out the following projects:
- ggplot2, an
Rpackage that implements a "grammar of graphics" (you may want to search the web for that term, books, etc.)
- Vega, a JSON-based declarative format for creating, saving, and sharing visualization designs.
- Metapost, a drawing language that is very nice for generating vector graphics of technical diagrams and pictures. It integrates very nicely with TeX, and as such it is very good for generating beautifully typeset math and text as well as pictures.
- Graphviz for visualizing graphs.
- Processing for generative/algorithmic art.
Some notable ommissions: vector drawing programs (like Inkspace or Illustrator) which are often useful, as well as any mapping tools (QGIS, ArcGIS, PostGIS, Leaflet.js, etc.)
Comma-Separated-Values (CSV) files and JSON are the main ways of storing data in text files these days. You may also have variants like tab-separated files.
In addition, you may encounter YAML files, or XML files. I strictly prefer CSV, JSON and YAML to XML though.
The usual shell tools (awk, sed, grep, sort, uniq, etc.) are usually great for working with CSV files. However, some CSV files are not super well formed, and in those cases you will need more powerful tools that usually as libraries for different programming languages. From the shell though, you may want to look at csvkit for CSV files, and jq for JSON files.
If you know or want to learn SQL and want to have a simple SQL database in a file, use SQLite. No need to setup a database server. For small datasets (less than, say, 2 gigabytes), you should be fine with SQLite.
For a lot more on command-line data processing tools, check out
I love Python. And you will probably like it too. You can do all this I'm telling you about in Python. It's not hard, and you can do everything from simple shell scripts, to web services, web portals, scientific computing, data analysis, machine learning, etc.
The problem is that for simple tasks, in my experience, your code will be much longer and it won't be any faster than using standard Unix tools. It is also not as easy to do simple things like switching directories and firing up a text editor from the Python propt (unless you use something like IPython).
The upside of using a real programming language like Python is that you can do many more tasks, in a single "way" (not mixing various shell tools), preserving readability and ease of debugging. If you find yourself writing scripts that are "too long", or too hacky, or even too slow, maybe it's time to drop the classic UNIX tools and use a real programming language.
A lot of what I wrote here about Python applies to other languages: I know nothing about Perl, but don't let that discourage you. It's often used for the types of tasks that are just above UNIX tools comfort zone. Ruby is also very popular and natural, but its ecosystem was built more towards web applications than data analysis. Lisp is a classic and extremely flexible family of languages where data and logic are intertwined. Don't let the parentheses-heavy syntax discourage you. For a good example of a Lisp with amazing tooling, see Racket. Haskell sounds really cool, and I love the syntax and the functional way of thinking. Julia is a new programming language that is particularly suited for data and numerical work.
That said, my personal recommendation as of Nov 2015 is to start with Python. It's very beginner-friendly, but also has powerful features. It's _very_ versatile. It comes with an amazing standard library and a whole suite of libraries for all sorts of tasks (just search scipy and pydata for the scientific computing anda data analysis tools). It has great tooling (debuggers, documentation generators, testng frameworks, interpreters, notebook-like interfaces, etc.). Last but not least, the community is very diverse and very friendly, probably more so than any other sizable programming-language-related community out there.
Some personal picks from the Python world: the
os module to handle
system-level procedures in a portable way, the
itertools module and maybe
functools module if you are dealing with large files. You will
probably also want to read on the
json modules. If you need to
manipulate matrices or efficient arrays, use NumPy; for more Scientific tools,
use SciPy ; for convex optimization, use CVXPY or CVXOPT; for data analysis,
check out pandas and other references in http://pydata.org/. For graph
theory, check out NetworkX .Or just use a everything-and-the-kitchen-sink
approach; install Anaconda Python or Enthought Canopy or Sage.
If your work involves writing a lot of plain text (like code, or any markup language like TeX or HTML), then you should learn to use a version control system (VCS). I would suggest git because it is the most common tool for open source projects (hosted usually on GitHub, but also see BitBucket and Gitlab). Here are some good resources:
- The book Pro Git is available online and is an excellent resource. Like everything, there is a learning curve, but this one is well worth it!
- If you are going to use Git at first to manage your individual projects, have a look at Everyday GIT With 20 Commands Or So. It will show you which commands are needed, and which ones are not.
- Git has lots of documentation available at your fingertips. For example, you can see the same content as in the "Everyday Git..." link with man giteveryday. Try also man git and man gittutorial, and look at the "see also" sections in those manual pages.
- Git has a very inconsistent interface but is conceptually not very hard to
understand. For that reason, I strongly suggest the presentation Git Core
Concepts by Ted Naleid. In particular, you will understand that git
branches are just automatically moving references to certain commits. Once
you understand the basic concepts (commits and references being the most
fundamental), you can focus on those, create your aliases and not worry about
a lot of the craziness of the interface. The presentation also collects some
useful hints and aliases to put in your
~/.gitconfig. You can obtain similar information with man gitcore-tutorial.
- It is a lot easier to manage an individual project with Git than group projects. That said, it may be worth to adopt "group-style" workflows as early as possible as you may never know when other people will contribute to your project, or when you are going to have to contribute to a new one. A good starting point is the feature branch workflow (also known as the GitHub Flow workflow). It is a very popular workflow for small teams that is very easy to use in individual projects but requires everyone to have write access to the master remote repository. In open source projects that may not be desirable, and something like the forking workflow may be a better fit. A very popular workflow in larger teams and enterprise environments is git flow. It is essentially the forking workflow with some pre-defined roles for maintenance branches, release branches, hotfix branches, etc. Here are the original explanation, a very succint introduction, a great cheat sheet and the GitHub repo of the git flow tool.
- If you want a remote home for your project, consider GitHub first, but also look at alternatives like Gitlab and BitBucket, especially if you want freely-hosted private repositories.
- Make meaningful commits (use
git stashfor random stop/pause points in your development) and write good, descriptive commig messages.
- Follow Tim Pope's notes on
gitcommit messages unless you know what you are doing and have some reason to ignore his advice.
- After you have used Git for a while, go and check these 19 tips for everyday git use.
We all should document what we do. I guess the Wiki is the right place to put things, but if you write code, the documentation should go with the code.
For the documentation, you will have a lot of choices of light markup languages. I tend to use reStructured Text (reST) because it is heavily used by Python coders and because it's nice and powerful (if somewhat more complex and messy than other alternatives). I have used txt2tags before, and it was also very nice. Markdown is probably the most popular one and very simple (but also limited). I have heard very nice things about asciidoc and textile. I know GitHub automatically and beautifully renders documents written in markdown reST, and asciidoc, so if you want a suggestion, I would stick with one of those three. Asciidoc can be used to publish books (some of the O'Reilly technical books were written in Asciidoc), and can export to Docbook, among other things. To document software, reST + Sphinx is a very nice combination in the Python world, but it can also be used elsewhere (C and C++). Different languages have different documentation practices, tools and traditions.
For writing beautifully-rendered PDF reports, especially if they contain any mathematics or technical graphs, TeX is hard to beat (but consider asciidoc and reST). LaTeX is by far the most popular way of using TeX's power, but I have a speciall fondness for ConTeXt. You can find links to everything TeX in the TeX User's Group page. I have written technical documents with reST (with MathJax if the report will be read in a browser), and I heard the same can be done with asciidoc, but the results were not nearly as aesthetically pleasant as the ones I obtained with TeX-based tools.
Pandoc is a good tool for converting between various markup languages.
For technical reports or interactive documents where the code actually matters, consider using the IPython. Check out this gallery of IPython notebooks to see what I am talking about. If you search around the web, you will see that you can use other languages with IPython, like R, Julia, Scala and Ruby.
Hint: I forgot where I read this, but I find it very useful. If you are
writing in a markup language, consider writing one sentence/clause per line. It
will be a lot easier to find text snippets in your file (no risk of not finding
"on me" because "on" is in one line, and "me" is in another), and the plethora
of line-oriented text tools that exist or can be built in UNIX will
be that much easier and pleasant to use. As a bonus, you will be able to
immediately to spot long clauses. If you additionally indent every line that is
not the beginning of a sentence, you will also be able to spot long sentences!
Finally, rewriting your sentences and finding what changed between versions
diff tool) will become much more natural.
Most tools mentioned here are present on most Unix-like systems. However, some
tools will have to be installed (for example,
wget in a Mac or most of the
tools mentioned in the documentation section). As every program mentioned here
is open source and freely available, you could download the source code and
compile it yourself. In most cases there is a more practical solution though:
use a package manager.
A package manager allows you to install a program and all its dependencies with a single command.
If you are using Linux or FreeBSD you are probably already familiar with your
package manager, but if you are on a Mac, maybe you haven't. The most used
solution among developers as of Nov 2015 is Homebrew. If Homebrew doesn't
work well for you, you may want to check out Macports. If you all you want is
a few extra shell tools (like
wget) for your mac, you may consider using
On Windows, make sure you check out Cygwin. It provides a POSIX compatibility layer, so you can use a UNIX-like shell and almost all if not all the tools mentioned here. In addition, it provides a bare-bones package manager.
The web has tons more of references. If you want books, check out
Of course there are other good publishers. But these cover the vast majority of the books I find useful.
O'Reilly has a permanent deal where you buy 2 books and get the third one for
free. You have to put the promo code in there though (currently,
|||You can search for POSIX man pages in http://www.duckduckgo.com by, for example, searching |