Skip to content

Instantly share code, notes, and snippets.

@crazyhottommy
Last active August 29, 2015 14:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save crazyhottommy/71e0dcb6d678c137733c to your computer and use it in GitHub Desktop.
Save crazyhottommy/71e0dcb6d678c137733c to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### I am going to demonstrate how to use ipython notebook bash_kernal to do reproducible research.\n",
"I can do command line in the notebook and take notes along the way.\n",
"Let's go to the directory first."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 256\r\n",
"drwxr-xr-x+ 82 Tammy staff 2788 May 4 22:06 ..\r\n",
"drwxr-xr-x 7 Tammy staff 238 May 4 21:42 .\r\n",
"-rw-r--r--@ 1 Tammy staff 6148 May 1 22:21 .DS_Store\r\n",
"-rw-r--r-- 1 Tammy staff 4608 May 1 22:00 iris.csv\r\n",
"drwxr-xr-x 3 Tammy staff 102 May 1 09:40 play\r\n",
"-rw-r-----@ 1 Tammy staff 114348 Mar 29 22:25 pybamview_example_data.tar.gz\r\n",
"drwxr-xr-x@ 7 Tammy staff 238 Jul 11 2014 examples\r\n"
]
}
],
"source": [
"cd playground\n",
"ls -alt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we are going to work with the famous iris.csv dataset which is from R.\n",
"First, look at the first serveral lines of the data."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sepal_length,sepal_width,petal_length,petal_width,species\r\n",
"5.1,3.5,1.4,0.2,Iris-setosa\r\n",
"4.9,3.0,1.4,0.2,Iris-setosa\r\n",
"4.7,3.2,1.3,0.2,Iris-setosa\r\n",
"4.6,3.1,1.5,0.2,Iris-setosa\r\n"
]
}
],
"source": [
"head -5 iris.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To have a better view of the data, use csvlook command from [csvkit](https://csvkit.readthedocs.org/en/0.9.1/). csvkit use comma as a default delimiter, if you have tab delimited file, use -t flag. There are many other useful commands,check the link above."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"|---------------+-------------+--------------+-------------+--------------|\r\n",
"| sepal_length | sepal_width | petal_length | petal_width | species |\r\n",
"|---------------+-------------+--------------+-------------+--------------|\r\n",
"| 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |\r\n",
"| 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |\r\n",
"| 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |\r\n",
"| 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |\r\n",
"| 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |\r\n",
"| 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |\r\n",
"| 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |\r\n",
"| 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |\r\n",
"| 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |\r\n",
"|---------------+-------------+--------------+-------------+--------------|\r\n"
]
}
],
"source": [
"cat iris.csv | head | csvlook"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is a comma seperated value file, we are going to look at some statistics by using [datamash](https://www.gnu.org/software/datamash/examples/)\n",
"It is a very interesting GNU project, and I like it very much. It is very powerful and enable me to do some\n",
"very useful stuff together with awk and sed. There are examples in the link working with gene annoation file.\n",
"\n",
"Let's look at the average sepal_length for each species. we can do it in R by dplyr easily, but I am going to \n",
"use command lines."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GroupBy(species),mean(sepal_length)\r\n",
"Iris-setosa,5.006\r\n",
"Iris-versicolor,5.936\r\n",
"Iris-virginica,6.588\r\n"
]
}
],
"source": [
"cat iris.csv | datamash -t \",\" -H -s -g 5 mean 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-H flag means there is a header in the iris.csv file, -s flag means sort the file first, -g means group the data by specices and then calculate the mean of the first column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another very useful tool that I came across is [q](https://github.com/harelba/q), which can execute SQL commands on plain txt files. q assumes the file is space delimited. use `-d \",\"` for comma delimited and `-t` for tab delimited files, respectively."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.006,Iris-setosa\r\n",
"5.936,Iris-versicolor\r\n",
"6.588,Iris-virginica\r\n"
]
}
],
"source": [
"cat iris.csv | q -H -d \",\" \"SELECT AVG(sepal_length), species from - Group BY species\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we got the same result as using datamash."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ipython bash_kernal can also print the figure inline.\n",
"I am going to use [Rio](https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/Rio) to interact R on the command line and print out the figure using display command following the\n",
"link here: [IBash Notebook](http://jeroenjanssens.com/2015/02/19/ibash-notebook.html)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": []
},
{
"data": {
"image/png": ""
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"cat iris.csv | Rio -ge \"g+geom_point(aes(x=sepal_length,y=sepal_width,colour=species))\"| display"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we get this figure inline, which I think is very awesome!\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### There are many limitations so far for the IBash_kernal. \n",
"1. One thing I found is that if the command is not correctly executed. the error will persist and you can not proceed. I have to restart the kernal to continue to work on the same notebook.\n",
"2. It can not display real-time data, less command will not work.\n",
"others can be found in the [post here](http://jeroenjanssens.com/2015/02/19/ibash-notebook.html)\n",
"Nevertherless, IBash Notebook gives a way to document your linux commands in a real-time manner and make your research reproducible to some extent!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Bash",
"language": "bash",
"name": "bash"
},
"language_info": {
"codemirror_mode": "shell",
"file_extension": ".sh",
"mimetype": "text/x-sh",
"name": "bash"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment