Skip to content

Instantly share code, notes, and snippets.

@isteves
Last active February 15, 2018 18:59
Show Gist options
  • Save isteves/935d602d35c72bde0225fca8789d7e70 to your computer and use it in GitHub Desktop.
Save isteves/935d602d35c72bde0225fca8789d7e70 to your computer and use it in GitHub Desktop.
Intro to Python and the Command Line for an R User
---
title: "Intro to Python and the Command Line for an R User"
author: "Irene Steves"
date: "`r format(Sys.Date())`"
output: github_document
---
Some things are just faster on the command line. But if you're not used to it, the command line is a dark and scary place. Thankfully, I had [amoeba](https://github.com/amoeba) to help me through:
1. Copying a file to another folder
2. Unzipping a *.tar.gz file
3. Running a Python (2) script
4. Exploring the data in Python (with `len`)
5. Turning all the variables generated by a script into a dataframe
6. Saving the dataframe to a csv
***
## 1. Copy a file to another folder
Start by accessing the datateam server on Terminal by typing `ssh USER_NAME@datateam.nceas.ucsb.edu`, followed by your password. By running things on the server, any processes running on your personal computer can continue running without slowing down.
Then, navigate to the folder that the file of interest is in. These functions are pretty handy for that:
- `pwd` = print working directory (R translation: `getwd()`)
- `cd` = change directory (R translation: `setwd()`)
- `ls` = list directory contents (R translation: `ls()`)
Since I started out in `home/isteves`, I navigated to (and looked at) the folder of the Coopman dataset that I was interested in, like this:
```
cd /
cd home/visitor/Coopman
ls
```
Once you're in, it's easy to grab the file you want. Just use `cp {FROM} {TO}`:
```
cp DATA_PM_FlexPart.tar.gz ~/
```
## 2. Unzipping a *.tar.gz file
Navigate into the directory that the tar file was saved into using `cd` like before, and run:
```
tar xzvf DATA_PM_FlexPart.tar.gz
```
The `tar` command is short for `tar e{x}tract} g{z}ip {v}verbose {f}ile {the file}`. Some ([amoeba](https://github.com/amoeba)) call it "shorthand magic" and indeed it is.
*Note:* If you're curious about a command line function, you can check out the details using `man` (R translation: `?` or `help()`). For example, `man tar` tells you all the possible commands you can use with `tar`. To quote [amoeba](https://github.com/amoeba): "A `man` page is like the shop manual for a car which is often overkill for [a beginner's] line of inquiry." An easier resource for deciphering command line code is [tldr](https://tldr.ostera.io/). Just search the function to find the English translation!
## 3. Running a Python (2) script
There are some details of running python scripts that we skimmed over during our learning session (in particular, installing Python libraries), so the following assumes that you have the Python infrastructure ready to use.
Start by navigating into the folder of the script you want to check out. In my case, I used `cd DATA_TEST` to get to the folder that I had unzipped.
Then, use `ipython` to start a Python session. It should display something similar to the following:
```
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
Type "copyright", "credits" or "license" for more information.
IPython 2.4.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
```
The file I'm interested in [co-locates](https://en.wikipedia.org/wiki/Collocation_(remote_sensing)) a bunch of sensor files. In other words, it takes data from a lot of different sources and synthesizes them. Specifically, the file contains one function `Read_PMFlexpart` that will compile the accompanying txt files into 30+ variables. I can thus read the function into Python and save the variables they've defined like this:
```
from Read_PMFlexPart import Read_PMFlexpart
LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000 = Read_PMFlexpart('./')
```
Notice that unlike R, Python can important a single function from a package/library: `from {PACKAGE/FILE} import {FUNCTION}`. Otherwise, think of it as the R equivalent of `source()`.
The second line in the code chunk above is a "multiple return" - it's another feature of Python that doesn't exist in R.
## 4. Exploring the data
Now that we have our variables, we can explore them a bit. To look at length, we can use the `len()` function (R translation: `length()`). To run `len()` on all our variables, we can use a type of for-loop:
```
vars = [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000]
[len(var) for var in vars]
```
Alternatively, you can try the Python equivalent of `*apply()`, which is `map()`.
```
map(len, [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000])
```
From either of the code chunks above, we see that all variables have 1619368 observations, except for *PZE_M*, which has 0. We'll want to take this into account for the next steps.
## 5. Turning all the variables generated by a script into a dataframe
Python doesn't have a built-in way for handling data frames, so that's where the Pandas package comes in. If you have it installed, you can load it with `import pandas as pd`. We use `pd` as an abbreviation for Pandas, which makes for less typing when we call functions from it. To create a data frame, for example, you can use:
```
output = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
```
If we were to translate the syntax to R, it would be something along the lines of:
```
output <- pd::DataFrame(x = c(1, 2, 3), y = c(4, 5, 6))
```
To save a single one of our variables from before to a data frame, you would want to run:
```
output = pd.DataFrame({'LIN': LIN})
```
If you use brackets around the second `LIN` (i.e., `[LIN]`), you would get a one-row data frame.
To try to get them all, you could manually type out all the variables using the above syntax. Alternatively, you could come up with a non-manual solution using some Python magic. Here's how we did it for our system:
```
var_names = [var for var in dir() if var.upper() == var and not var.startswith('_') and var != "PZE_M"]
var_dict = dict(zip(var_names, [eval(var) for var in var_names]))
output = pd.DataFrame(var_dict)
```
The first line does some extra filtering to grab the relevant variables from our Python environment and exclude *PZE_M*. The second line creates a dictionary using Python tricks that I won't get into for now. Finally, the third line converts the variable dictionary to a dataframe, which is then saved to `output`.
## 6. Saving the dataframe to a csv
If you've made it this far, then the last step is a one-line breeze: `output.to_csv('output.csv')` (R translation: `write.csv()`).
Like in R, the output defaults to having numbered row names (`row.names = TRUE`). If you don't want to add them, you can use `index = False` like so: `output.to_csv("output.csv", index = False)`.
Check the folder you've been working in, and you should see it pop up in no time!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment