Skip to content

Instantly share code, notes, and snippets.

Created October 3, 2014 16:39
Show Gist options
  • Save alexstorer/73219f8386b090dab091 to your computer and use it in GitHub Desktop.
Save alexstorer/73219f8386b090dab091 to your computer and use it in GitHub Desktop.
The basic processing pipeline for Python.
Display the source blob
Display the rendered blob
"metadata": {
"name": "",
"signature": "sha256:49da2f337895f3f82db401d063148d8101a5a7a881b7a256ec368b47f70d8b47"
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"The Python \"Pipeline\"\n",
"The 'pipeline' is the way we go from raw input to processed output. In many cases, the raw input is in a series of files on your hard drive, and the output will be a csv file.\n",
"The two most common ways to query for files from within python are `glob.glob` and `os.walk`. `glob` tries to emulate the Unix `ls` command, while `os.walk` craws a directory for all subfiles.\n",
"My favorite way to write a csv from Python is using the `csv.DictWriter` tool. It takes a python dictionary and treats it as a row of a csv file. Pretend that a dictionary is called `row` and the columns are `c0`, `c1`, etc. In python, we will store data as `row[c0] = exampledata`. To make a `DictWriter`, we will need to proide the location of the file, as well as the columns of the csv file.\n",
"Now, let's look at a full example that reads things in, finds a simple regular expression, and writes out the data."
"cell_type": "code",
"collapsed": false,
"input": [
"import glob\n",
"import csv\n",
"import re\n",
"fout = open('output.csv','w')\n",
"fieldnames = ['file','200','404']\n",
"dw = csv.DictWriter(fout,fieldnames)\n",
"weblogs = glob.glob('/Users/astorer/Teaching/2015_programming/Data/weblogs/*.log*')\n",
"for w in weblogs:\n",
" f = open(w,'r')\n",
" # load the entire contents of the file into memory\n",
" logdata =\n",
" results = re.findall('HTTP/\\d+\\.\\d+\\\" (\\d+)',logdata)\n",
" f.close()\n",
" \n",
" countdict = dict()\n",
" for r in results:\n",
" if r in countdict:\n",
" countdict[r]+=1\n",
" else:\n",
" countdict[r] = 1\n",
" row = dict()\n",
" row['file'] = w\n",
" row['200'] = countdict['200']\n",
" row['404'] = countdict['404']\n",
" dw.writerow(row)\n",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 32
"cell_type": "code",
"collapsed": false,
"input": [
"language": "python",
"metadata": {},
"outputs": [
"metadata": {},
"output_type": "pyout",
"prompt_number": 41,
"text": [
"{'103': 2,\n",
" '200': 98655,\n",
" '206': 2124,\n",
" '301': 4923,\n",
" '302': 1828,\n",
" '304': 1563,\n",
" '403': 990,\n",
" '404': 13708,\n",
" '500': 46}"
"prompt_number": 41
"cell_type": "code",
"collapsed": false,
"input": [
"fout = open('output.csv','r')\n",
"for line in fout:\n",
" print line\n",
"language": "python",
"metadata": {},
"outputs": [
"output_type": "stream",
"stream": "stdout",
"text": [
"prompt_number": 33
"metadata": {}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment