prereq to making data the datamade way for new hires/etl beginners
- the essential bash
- python patterns for custom transform code
- tips of the iceberg for: postgresql, sql the languageâ„¢, geospatial magic
prereq to making data the datamade way for new hires/etl beginners
While we strive to write as little custom code as possible, you will encounter datasets during your ETL endeavors that need filtering, mapping, or other transformations that are (1) super unreadable or (2) impossible in bash or SQL. In these instances, you'll need to write a processor – that is, "a script for a single operation, often written in Python."
Some examples include:
Converting header rows to a new column
Park, Park,Result,Location
Result,Location => Park,Result,Location
Result,Location
Converting timestamps for machines to dates for humans
1163548800000 => '2006-11-14 18:00:00'
Psst. Unfamiliar with epoch timestamps? Read up, they're neat.
Light analysis without a database or fancy data libraries
Park,Location,Result
Park,Location,Result => Park,Score,Number of Fixtures
Park,Location,Result
While each of these scripts serves a different purpose, they also share some patterns. Let's enumerate them.
Our guidelines stipulate that all custom Python scripts should read from STDIN
(say "standard in") and write to STDOUT
("standard out"). This allows you to: conduct operations (like grabbing the columns you want with csvcut
), then apply your processor to the output; and/or run your processor, then conduct operations (like filtering the rows you want with csvgrep
) on the output.
In Python you access STDIN
and STDOUT
with the sys
module. sys.stdin
and sys.stdout
are "file-like" objects (you access and alter the content via read()
and write()
methods, respectively).
import sys
in = sys.stdin.read()
# do something cool like reverse the string via extended slice
# https://docs.python.org/3/whatsnew/2.3.html#extended-slices
processed_in = in[::-1]
sys.stdout.write(processed_in)
$ echo "this is cool" | python cool_demo.py
'looc si siht'
Go ahead, play around. Then get ready to never access STDIN
and STDOUT
this way again, because...
csv
is a built-in Python module that does 99% of the work of working with tabular data for you. It's also really well documented; check it out.
csv.reader()
: Feed this a file-like object (like sys.stdin
) and you get an iterable of all the rows in your CSV.
reader = csv.reader(sys.stdin)
csv.writer()
: Also takes a file-like object (like sys.stdout
). Use writer.writerow()
to output your CSV without stressing over formatting ickies like double-quoting column values that contain commas.
writer = csv.writer(sys.stdout)
Hate your header row? Ditch it right after instantiating your reader, like so.
next(reader)
Need to alter your data?
for row in reader:
# you can access your data by index (row[0]) or unpack it
# (use underscores to unpack values you don't care about)
fish, dog, pony = row
if fish == 'tuna salad': # tuna salad is gross
fish == 'spicy tuna maki'
# write it to stdout
writer.writerow([fish, dog, pony])
Feeling dangerous yet? (If no, learn up on DictReader
and DictWriter
objects, used here to decode unfriendly header and field values.)
Finally, itertools
ships with Python and provides a ton of efficient ways to slice'n'dice iterables. As such, "Cool tricks in itertools
" blog posts are a dime a dozen. That said, I really liked this one.
Take some time to peruse, bearing in mind that these can be applied as you iterate over the rows in your CSV – or to values in the rows themselves, since csv.reader
feeds you rows as iterables.
In the beginning, there will be barriers to working with Makefiles, and 95% of those barriers will be understanding what the heck is going on.
Bash is a cryptic little language; interpreting it can feel like reading someone else's shorthand. But when you get it, it feels a little something like –
This guide will cover a tiny subset of commands at surface level in order to get you on your way. Want more? Type man
, then the command you want to research into your terminal – i.e. man ls
– and away you go. (Use the up/down arrows to scroll, and q
to exit.)
There's this interview entitled "The Evolutionary Argument Against Reality" in which a scientist who thinks our user interface for the world may be wildly different from world itself provides an example that's salient to this guide:
There’s a metaphor that’s only been available to us in the past 30 or 40 years, and that’s the desktop interface. Suppose there’s a blue rectangular icon on the lower right corner of your computer’s desktop — does that mean that the file itself is blue and rectangular and lives in the lower right corner of your computer? Of course not... And yet the desktop is useful. That blue rectangular icon guides my behavior, and it hides a complex reality...
The shell removes that layer of graphical abstraction, bringing you closer to the "complex reality" of your computer. (Please excuse the logical error of assuming we can know the reality of our computers with any certainty.)
More to the point, employing bash to conduct routine transformations keeps you in compliance with our style guide, which stipulates that DataMade's data makers shall "prefer simple Unix tools to custom scripts."
This walkthrough assumes you know the basics of how your files are organized and making your way around using bash commands like ls
, cd
and pwd
.
To set a variable, enter the variable name (all caps by convention), an equals sign, and your desired value. Be sure not to include spaces on either side of the equals sign, and put any values that contain spaces in quotation marks.
$ MYVAR="foo foo foo"
To reference your variable later, put a dollar sign in front.
$ echo $MYVAR
foo foo foo
Use a subshell, $()
, to assign the output of some command to a variable, or to nest the outout of some command inside another command.
MY_PNGS="$(ls *.png)" # set a variable equal to all the pngs in your current directory
csvstack $(ls *.csv) # stack all csvs in your directory into a single csv
Any value containing spaces, file names included, will activate your script's gag reflex unless it's enclosed in quotation marks. Be careful, though, in which kind you choose: Single quotes are interpreted literally, e.g. nothing inside them will be evaluated. Double quotes will expand variables and subshells.
$ echo '$(pwd)'
$(pwd)
$ echo "$(pwd)"
~/somedir/anotherdir/presentdir