alexstorer/pipeline.ipynb

## pipeline.ipynb
{
 "metadata": {
  "name": "",
  "signature": "sha256:49da2f337895f3f82db401d063148d8101a5a7a881b7a256ec368b47f70d8b47"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The Python \"Pipeline\"\n",
      "-----------------\n",
      "\n",
      "The 'pipeline' is the way we go from raw input to processed output.  In many cases, the raw input is in a series of files on your hard drive, and the output will be a csv file.\n",
      "\n",
      "The two most common ways to query for files from within python are `glob.glob` and `os.walk`.  `glob` tries to emulate the Unix `ls` command, while `os.walk` craws a directory for all subfiles.\n",
      "\n",
      "My favorite way to write a csv from Python is using the `csv.DictWriter` tool.  It takes a python dictionary and treats it as a row of a csv file.  Pretend that a dictionary is called `row` and the columns are `c0`, `c1`, etc.  In python, we will store data as `row[c0] = exampledata`.  To make a `DictWriter`, we will need to proide the location of the file, as well as the columns of the csv file.\n",
      "\n",
      "---------\n",
      "\n",
      "Now, let's look at a full example that reads things in, finds a simple regular expression, and writes out the data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import glob\n",
      "import csv\n",
      "import re\n",
      "\n",
      "fout = open('output.csv','w')\n",
      "fieldnames = ['file','200','404']\n",
      "dw = csv.DictWriter(fout,fieldnames)\n",
      "dw.writeheader()\n",
      "\n",
      "weblogs = glob.glob('/Users/astorer/Teaching/2015_programming/Data/weblogs/*.log*')\n",
      "\n",
      "\n",
      "for w in weblogs:\n",
      "    f = open(w,'r')\n",
      "    # load the entire contents of the file into memory\n",
      "    logdata = f.read()\n",
      "    results = re.findall('HTTP/\\d+\\.\\d+\\\" (\\d+)',logdata)\n",
      "    f.close()\n",
      "    \n",
      "    countdict = dict()\n",
      "    for r in results:\n",
      "        if r in countdict:\n",
      "            countdict[r]+=1\n",
      "        else:\n",
      "            countdict[r] = 1\n",
      "\n",
      "    row = dict()\n",
      "    row['file'] = w\n",
      "    row['200'] = countdict['200']\n",
      "    row['404'] = countdict['404']\n",
      "    dw.writerow(row)\n",
      "fout.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 32
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "countdict"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 41,
       "text": [
        "{'103': 2,\n",
        " '200': 98655,\n",
        " '206': 2124,\n",
        " '301': 4923,\n",
        " '302': 1828,\n",
        " '304': 1563,\n",
        " '403': 990,\n",
        " '404': 13708,\n",
        " '500': 46}"
       ]
      }
     ],
     "prompt_number": 41
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fout = open('output.csv','r')\n",
      "for line in fout:\n",
      "    print line\n",
      "fout.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "file,200,404\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140915,88775,10082\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140916,99968,15044\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140917,100206,14359\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140918,96831,13989\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140919,96998,13430\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140920,96628,12062\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140921,86208,5210\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140922,76706,9033\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140923,89607,11692\r\n",
        "\n",
        "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140924,98655,13708\r\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 33
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "",
	"signature": "sha256:49da2f337895f3f82db401d063148d8101a5a7a881b7a256ec368b47f70d8b47"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The Python \"Pipeline\"\n",
	"-----------------\n",
	"\n",
	"The 'pipeline' is the way we go from raw input to processed output. In many cases, the raw input is in a series of files on your hard drive, and the output will be a csv file.\n",
	"\n",
	"The two most common ways to query for files from within python are `glob.glob` and `os.walk`. `glob` tries to emulate the Unix `ls` command, while `os.walk` craws a directory for all subfiles.\n",
	"\n",
	"My favorite way to write a csv from Python is using the `csv.DictWriter` tool. It takes a python dictionary and treats it as a row of a csv file. Pretend that a dictionary is called `row` and the columns are `c0`, `c1`, etc. In python, we will store data as `row[c0] = exampledata`. To make a `DictWriter`, we will need to proide the location of the file, as well as the columns of the csv file.\n",
	"\n",
	"---------\n",
	"\n",
	"Now, let's look at a full example that reads things in, finds a simple regular expression, and writes out the data."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import glob\n",
	"import csv\n",
	"import re\n",
	"\n",
	"fout = open('output.csv','w')\n",
	"fieldnames = ['file','200','404']\n",
	"dw = csv.DictWriter(fout,fieldnames)\n",
	"dw.writeheader()\n",
	"\n",
	"weblogs = glob.glob('/Users/astorer/Teaching/2015_programming/Data/weblogs/.log')\n",
	"\n",
	"\n",
	"for w in weblogs:\n",
	" f = open(w,'r')\n",
	" # load the entire contents of the file into memory\n",
	" logdata = f.read()\n",
	" results = re.findall('HTTP/\\d+\\.\\d+\\\" (\\d+)',logdata)\n",
	" f.close()\n",
	" \n",
	" countdict = dict()\n",
	" for r in results:\n",
	" if r in countdict:\n",
	" countdict[r]+=1\n",
	" else:\n",
	" countdict[r] = 1\n",
	"\n",
	" row = dict()\n",
	" row['file'] = w\n",
	" row['200'] = countdict['200']\n",
	" row['404'] = countdict['404']\n",
	" dw.writerow(row)\n",
	"fout.close()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 32
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"countdict"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"metadata": {},
	"output_type": "pyout",
	"prompt_number": 41,
	"text": [
	"{'103': 2,\n",
	" '200': 98655,\n",
	" '206': 2124,\n",
	" '301': 4923,\n",
	" '302': 1828,\n",
	" '304': 1563,\n",
	" '403': 990,\n",
	" '404': 13708,\n",
	" '500': 46}"
	]
	}
	],
	"prompt_number": 41
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"fout = open('output.csv','r')\n",
	"for line in fout:\n",
	" print line\n",
	"fout.close()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"file,200,404\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140915,88775,10082\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140916,99968,15044\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140917,100206,14359\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140918,96831,13989\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140919,96998,13430\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140920,96628,12062\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140921,86208,5210\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140922,76706,9033\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140923,89607,11692\r\n",
	"\n",
	"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140924,98655,13708\r\n",
	"\n"
	]
	}
	],
	"prompt_number": 33
	}
	],
	"metadata": {}
	}
	]
	}