Skip to content

Instantly share code, notes, and snippets.

@hancush
Last active May 17, 2017 17:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hancush/b1b7944cc44ddb2c89e9c056c5ae486d to your computer and use it in GitHub Desktop.
Save hancush/b1b7944cc44ddb2c89e9c056c5ae486d to your computer and use it in GitHub Desktop.
drafts for making data 001 👶

making data 001

prereq to making data the datamade way for new hires/etl beginners

to-do

  • the essential bash
  • python patterns for custom transform code
  • tips of the iceberg for: postgresql, sql the languageâ„¢, geospatial magic

Custom Transformations in Python

Getting started

While we strive to write as little custom code as possible, you will encounter datasets during your ETL endeavors that need filtering, mapping, or other transformations that are (1) super unreadable or (2) impossible in bash or SQL. In these instances, you'll need to write a processor – that is, "a script for a single operation, often written in Python."

Some examples include:

Converting header rows to a new column

Park,                          Park,Result,Location
Result,Location       =>       Park,Result,Location
Result,Location

Converting timestamps for machines to dates for humans

1163548800000 => '2006-11-14 18:00:00'

Psst. Unfamiliar with epoch timestamps? Read up, they're neat.

Light analysis without a database or fancy data libraries

Park,Location,Result              
Park,Location,Result       =>       Park,Score,Number of Fixtures
Park,Location,Result

While each of these scripts serves a different purpose, they also share some patterns. Let's enumerate them.

sys

Our guidelines stipulate that all custom Python scripts should read from STDIN (say "standard in") and write to STDOUT ("standard out"). This allows you to: conduct operations (like grabbing the columns you want with csvcut), then apply your processor to the output; and/or run your processor, then conduct operations (like filtering the rows you want with csvgrep) on the output.

In Python you access STDIN and STDOUT with the sys module. sys.stdin and sys.stdout are "file-like" objects (you access and alter the content via read() and write() methods, respectively).

import sys

in = sys.stdin.read()

# do something cool like reverse the string via extended slice
# https://docs.python.org/3/whatsnew/2.3.html#extended-slices
processed_in = in[::-1]

sys.stdout.write(processed_in)
$ echo "this is cool" | python cool_demo.py 
'looc si siht'

Go ahead, play around. Then get ready to never access STDIN and STDOUT this way again, because...

csv

csv is a built-in Python module that does 99% of the work of working with tabular data for you. It's also really well documented; check it out.

The basics

csv.reader(): Feed this a file-like object (like sys.stdin) and you get an iterable of all the rows in your CSV.

reader = csv.reader(sys.stdin)

csv.writer(): Also takes a file-like object (like sys.stdout). Use writer.writerow() to output your CSV without stressing over formatting ickies like double-quoting column values that contain commas.

writer = csv.writer(sys.stdout)

Hate your header row? Ditch it right after instantiating your reader, like so.

next(reader)

Need to alter your data?

for row in reader:
    # you can access your data by index (row[0]) or unpack it 
    # (use underscores to unpack values you don't care about)
    fish, dog, pony = row
    if fish == 'tuna salad': # tuna salad is gross
        fish == 'spicy tuna maki'
    # write it to stdout
    writer.writerow([fish, dog, pony])

Feeling dangerous yet? (If no, learn up on DictReader and DictWriter objects, used here to decode unfriendly header and field values.)

itertools

Finally, itertools ships with Python and provides a ton of efficient ways to slice'n'dice iterables. As such, "Cool tricks in itertools" blog posts are a dime a dozen. That said, I really liked this one.

Take some time to peruse, bearing in mind that these can be applied as you iterate over the rows in your CSV – or to values in the rows themselves, since csv.reader feeds you rows as iterables.

The Essential Bash (for Working With Data)

In the beginning, there will be barriers to working with Makefiles, and 95% of those barriers will be understanding what the heck is going on.

Bash is a cryptic little language; interpreting it can feel like reading someone else's shorthand. But when you get it, it feels a little something like –

this

This guide will cover a tiny subset of commands at surface level in order to get you on your way. Want more? Type man, then the command you want to research into your terminal – i.e. man ls – and away you go. (Use the up/down arrows to scroll, and q to exit.)

"The subtler arts of computing"

There's this interview entitled "The Evolutionary Argument Against Reality" in which a scientist who thinks our user interface for the world may be wildly different from world itself provides an example that's salient to this guide:

There’s a metaphor that’s only been available to us in the past 30 or 40 years, and that’s the desktop interface. Suppose there’s a blue rectangular icon on the lower right corner of your computer’s desktop — does that mean that the file itself is blue and rectangular and lives in the lower right corner of your computer? Of course not... And yet the desktop is useful. That blue rectangular icon guides my behavior, and it hides a complex reality...

The shell removes that layer of graphical abstraction, bringing you closer to the "complex reality" of your computer. (Please excuse the logical error of assuming we can know the reality of our computers with any certainty.)

More to the point, employing bash to conduct routine transformations keeps you in compliance with our style guide, which stipulates that DataMade's data makers shall "prefer simple Unix tools to custom scripts."

Prerequisites

This walkthrough assumes you know the basics of how your files are organized and making your way around using bash commands like ls, cd and pwd.

Useful trivia

Variables

To set a variable, enter the variable name (all caps by convention), an equals sign, and your desired value. Be sure not to include spaces on either side of the equals sign, and put any values that contain spaces in quotation marks.

$ MYVAR="foo foo foo"

To reference your variable later, put a dollar sign in front.

$ echo $MYVAR
foo foo foo

Subshells

Use a subshell, $(), to assign the output of some command to a variable, or to nest the outout of some command inside another command.

MY_PNGS="$(ls *.png)" # set a variable equal to all the pngs in your current directory
csvstack $(ls *.csv) # stack all csvs in your directory into a single csv

Quotes

Any value containing spaces, file names included, will activate your script's gag reflex unless it's enclosed in quotation marks. Be careful, though, in which kind you choose: Single quotes are interpreted literally, e.g. nothing inside them will be evaluated. Double quotes will expand variables and subshells.

$ echo '$(pwd)'
$(pwd)
$ echo "$(pwd)"
~/somedir/anotherdir/presentdir

Finding things

  • grep

Viewing the contents of files

  • cat
  • less
  • head
  • tail

Editing directories & files

  • touch
  • echo
  • mkdir
  • rm
  • cp
  • mv
  • perl

Chaining commands

  • redirects
  • pipes
  • semicolons
  • ampersands
  • backslashes

Control flow

Command-line interfaces (CLIs) to know

  • wget
  • psql
  • csvkit
  • postgresql: intro to databases, intro to sql, psql the ide, psql the cli
  • postgis: basic geo queries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment