Created
October 3, 2014 16:39
-
-
Save alexstorer/73219f8386b090dab091 to your computer and use it in GitHub Desktop.
The basic processing pipeline for Python.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "", | |
"signature": "sha256:49da2f337895f3f82db401d063148d8101a5a7a881b7a256ec368b47f70d8b47" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The Python \"Pipeline\"\n", | |
"-----------------\n", | |
"\n", | |
"The 'pipeline' is the way we go from raw input to processed output. In many cases, the raw input is in a series of files on your hard drive, and the output will be a csv file.\n", | |
"\n", | |
"The two most common ways to query for files from within python are `glob.glob` and `os.walk`. `glob` tries to emulate the Unix `ls` command, while `os.walk` craws a directory for all subfiles.\n", | |
"\n", | |
"My favorite way to write a csv from Python is using the `csv.DictWriter` tool. It takes a python dictionary and treats it as a row of a csv file. Pretend that a dictionary is called `row` and the columns are `c0`, `c1`, etc. In python, we will store data as `row[c0] = exampledata`. To make a `DictWriter`, we will need to proide the location of the file, as well as the columns of the csv file.\n", | |
"\n", | |
"---------\n", | |
"\n", | |
"Now, let's look at a full example that reads things in, finds a simple regular expression, and writes out the data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import glob\n", | |
"import csv\n", | |
"import re\n", | |
"\n", | |
"fout = open('output.csv','w')\n", | |
"fieldnames = ['file','200','404']\n", | |
"dw = csv.DictWriter(fout,fieldnames)\n", | |
"dw.writeheader()\n", | |
"\n", | |
"weblogs = glob.glob('/Users/astorer/Teaching/2015_programming/Data/weblogs/*.log*')\n", | |
"\n", | |
"\n", | |
"for w in weblogs:\n", | |
" f = open(w,'r')\n", | |
" # load the entire contents of the file into memory\n", | |
" logdata = f.read()\n", | |
" results = re.findall('HTTP/\\d+\\.\\d+\\\" (\\d+)',logdata)\n", | |
" f.close()\n", | |
" \n", | |
" countdict = dict()\n", | |
" for r in results:\n", | |
" if r in countdict:\n", | |
" countdict[r]+=1\n", | |
" else:\n", | |
" countdict[r] = 1\n", | |
"\n", | |
" row = dict()\n", | |
" row['file'] = w\n", | |
" row['200'] = countdict['200']\n", | |
" row['404'] = countdict['404']\n", | |
" dw.writerow(row)\n", | |
"fout.close()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 32 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"countdict" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"metadata": {}, | |
"output_type": "pyout", | |
"prompt_number": 41, | |
"text": [ | |
"{'103': 2,\n", | |
" '200': 98655,\n", | |
" '206': 2124,\n", | |
" '301': 4923,\n", | |
" '302': 1828,\n", | |
" '304': 1563,\n", | |
" '403': 990,\n", | |
" '404': 13708,\n", | |
" '500': 46}" | |
] | |
} | |
], | |
"prompt_number": 41 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"fout = open('output.csv','r')\n", | |
"for line in fout:\n", | |
" print line\n", | |
"fout.close()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"file,200,404\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140915,88775,10082\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140916,99968,15044\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140917,100206,14359\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140918,96831,13989\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140919,96998,13430\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140920,96628,12062\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140921,86208,5210\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140922,76706,9033\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140923,89607,11692\r\n", | |
"\n", | |
"/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140924,98655,13708\r\n", | |
"\n" | |
] | |
} | |
], | |
"prompt_number": 33 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment