hancush/README.md

## README.md

      
    Raw
  

              README.md
            
          
    making data 001

prereq to making data the datamade way for new hires/etl beginners
to-do


 the essential bash
 python patterns for custom transform code
 tips of the iceberg for: postgresql, sql the language™, geospatial magic


## patterns-for-custom-transformation-in-python.md

      
    Raw
  

              patterns-for-custom-transformation-in-python.md
            
          
    Custom Transformations in Python

Getting started

While we strive to write as little custom code as possible, you will encounter datasets during your ETL endeavors that need filtering, mapping, or other transformations that are (1) super unreadable or (2) impossible in bash or SQL. In these instances, you'll need to write a processor – that is, "a script for a single operation, often written in Python."
Some examples include:
Converting header rows to a new column
Park,                          Park,Result,Location
Result,Location       =>       Park,Result,Location
Result,Location

Converting timestamps for machines to dates for humans
1163548800000 => '2006-11-14 18:00:00'

^{Psst. Unfamiliar with epoch timestamps? Read up, they're neat.}
Light analysis without a database or fancy data libraries
Park,Location,Result              
Park,Location,Result       =>       Park,Score,Number of Fixtures
Park,Location,Result

While each of these scripts serves a different purpose, they also share some patterns. Let's enumerate them.
sys

Our guidelines stipulate that all custom Python scripts should read from STDIN (say "standard in") and write to STDOUT ("standard out"). This allows you to: conduct operations (like grabbing the columns you want with csvcut), then apply your processor to the output; and/or run your processor, then conduct operations (like filtering the rows you want with csvgrep) on the output.
In Python you access STDIN and STDOUT with the sys module. sys.stdin and sys.stdout are "file-like" objects (you access and alter the content via read() and write() methods, respectively).
import sys

in = sys.stdin.read()

# do something cool like reverse the string via extended slice
# https://docs.python.org/3/whatsnew/2.3.html#extended-slices
processed_in = in[::-1]

sys.stdout.write(processed_in)
$ echo "this is cool" | python cool_demo.py 
'looc si siht'

Go ahead, play around. Then get ready to never access STDIN and STDOUT this way again, because...
csv

csv is a built-in Python module that does 99% of the work of working with tabular data for you. It's also really well documented; check it out.
The basics

csv.reader(): Feed this a file-like object (like sys.stdin) and you get an iterable of all the rows in your CSV.
reader = csv.reader(sys.stdin)
csv.writer(): Also takes a file-like object (like sys.stdout). Use writer.writerow() to output your CSV without stressing over formatting ickies like double-quoting column values that contain commas.
writer = csv.writer(sys.stdout)
Hate your header row? Ditch it right after instantiating your reader, like so.
next(reader)
Need to alter your data?
for row in reader:
    # you can access your data by index (row[0]) or unpack it 
    # (use underscores to unpack values you don't care about)
    fish, dog, pony = row
    if fish == 'tuna salad': # tuna salad is gross
        fish == 'spicy tuna maki'
    # write it to stdout
    writer.writerow([fish, dog, pony])
Feeling dangerous yet? (If no, learn up on DictReader and DictWriter objects, used here to decode unfriendly header and field values.)
itertools

Finally, itertools ships with Python and provides a ton of efficient ways to slice'n'dice iterables. As such, "Cool tricks in itertools" blog posts are a dime a dozen. That said, I really liked this one.
Take some time to peruse, bearing in mind that these can be applied as you iterate over the rows in your CSV – or to values in the rows themselves, since csv.reader feeds you rows as iterables.

  
## the-essential-bash.md

      
    Raw
  

              the-essential-bash.md
            
          
    The Essential Bash (for Working With Data)

In the beginning, there will be barriers to working with Makefiles, and 95% of those barriers will be understanding what the heck is going on.
Bash is a cryptic little language; interpreting it can feel like reading someone else's shorthand. But when you get it, it feels a little something like –

This guide will cover a tiny subset of commands at surface level in order to get you on your way. Want more? Type man, then the command you want to research into your terminal – i.e. man ls – and away you go. (Use the up/down arrows to scroll, and q to exit.)
"The subtler arts of computing"

There's this interview entitled "The Evolutionary Argument Against Reality" in which a scientist who thinks our user interface for the world may be wildly different from world itself provides an example that's salient to this guide:

There’s a metaphor that’s only been available to us in the past 30 or 40 years, and that’s the desktop interface. Suppose there’s a blue rectangular icon on the lower right corner of your computer’s desktop — does that mean that the file itself is blue and rectangular and lives in the lower right corner of your computer? Of course not... And yet the desktop is useful. That blue rectangular icon guides my behavior, and it hides a complex reality...

The shell removes that layer of graphical abstraction, bringing you closer to the "complex reality" of your computer. (Please excuse the logical error of assuming we can know the reality of our computers with any certainty.)
More to the point, employing bash to conduct routine transformations keeps you in compliance with our style guide, which stipulates that DataMade's data makers shall "prefer simple Unix tools to custom scripts."
Prerequisites

This walkthrough assumes you know the basics of how your files are organized and making your way around using bash commands like ls, cd and pwd.
Useful trivia

Variables

To set a variable, enter the variable name (all caps by convention), an equals sign, and your desired value. Be sure not to include spaces on either side of the equals sign, and put any values that contain spaces in quotation marks.
$ MYVAR="foo foo foo"
To reference your variable later, put a dollar sign in front.
$ echo $MYVAR
foo foo foo
Subshells

Use a subshell, $(), to assign the output of some command to a variable, or to nest the outout of some command inside another command.
MY_PNGS="$(ls *.png)" # set a variable equal to all the pngs in your current directory
csvstack $(ls *.csv) # stack all csvs in your directory into a single csv
Quotes

Any value containing spaces, file names included, will activate your script's gag reflex unless it's enclosed in quotation marks. Be careful, though, in which kind you choose: Single quotes are interpreted literally, e.g. nothing inside them will be evaluated. Double quotes will expand variables and subshells.
$ echo '$(pwd)'
$(pwd)
$ echo "$(pwd)"
~/somedir/anotherdir/presentdir
Finding things


grep

Viewing the contents of files


cat
less
head
tail

Editing directories & files


touch
echo
mkdir
rm
cp
mv
perl

Chaining commands


redirects
pipes
semicolons
ampersands
backslashes

Control flow


ifs


loops


Command-line interfaces (CLIs) to know


wget
psql
csvkit


## tips-of-the-iceberg.md

      
    Raw
  

              tips-of-the-iceberg.md
            
          
 postgresql: intro to databases, intro to sql, psql the ide, psql the cli
 postgis: basic geo queries