isteves/command_line_python.Rmd

## command_line_python.Rmd
---
title: "Intro to Python and the Command Line for an R User"
author: "Irene Steves"
date: "`r format(Sys.Date())`"
output: github_document
---

Some things are just faster on the command line. But if you're not used to it, the command line is a dark and scary place.  Thankfully, I had [amoeba](https://github.com/amoeba) to help me through:

1. Copying a file to another folder
2. Unzipping a *.tar.gz file
3. Running a Python (2) script
4. Exploring the data in Python (with `len`)
5. Turning all the variables generated by a script into a dataframe
6. Saving the dataframe to a csv

***

## 1. Copy a file to another folder

Start by accessing the datateam server on Terminal by typing `ssh USER_NAME@datateam.nceas.ucsb.edu`, followed by your password. By running things on the server, any processes running on your personal computer can continue running without slowing down.

Then, navigate to the folder that the file of interest is in. These functions are pretty handy for that:

- `pwd` = print working directory (R translation: `getwd()`)
- `cd` = change directory (R translation: `setwd()`)
- `ls` = list directory contents (R translation: `ls()`)

Since I started out in `home/isteves`, I navigated to (and looked at) the folder of the Coopman dataset that I was interested in, like this:

```
cd /
cd home/visitor/Coopman
ls
```

Once you're in, it's easy to grab the file you want. Just use `cp {FROM} {TO}`:

```
cp DATA_PM_FlexPart.tar.gz ~/
```

## 2. Unzipping a *.tar.gz file

Navigate into the directory that the tar file was saved into using `cd` like before, and run:

```
tar xzvf DATA_PM_FlexPart.tar.gz
```

The `tar` command is short for `tar e{x}tract} g{z}ip {v}verbose {f}ile {the file}`. Some ([amoeba](https://github.com/amoeba)) call it "shorthand magic" and indeed it is.

*Note:* If you're curious about a command line function, you can check out the details using `man` (R translation: `?` or `help()`). For example, `man tar` tells you all the possible commands you can use with `tar`. To quote [amoeba](https://github.com/amoeba): "A `man` page is like the shop manual for a car which is often overkill for [a beginner's] line of inquiry." An easier resource for deciphering command line code is [tldr](https://tldr.ostera.io/). Just search the function to find the English translation!

## 3. Running a Python (2) script

There are some details of running python scripts that we skimmed over during our learning session (in particular, installing Python libraries), so the following assumes that you have the Python infrastructure ready to use.

Start by navigating into the folder of the script you want to check out. In my case, I used `cd DATA_TEST` to get to the folder that I had unzipped.

Then, use `ipython` to start a Python session. It should display something similar to the following:
```
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
Type "copyright", "credits" or "license" for more information.

IPython 2.4.1 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
```

The file I'm interested in [co-locates](https://en.wikipedia.org/wiki/Collocation_(remote_sensing)) a bunch of sensor files. In other words, it takes data from a lot of different sources and synthesizes them. Specifically, the file contains one function `Read_PMFlexpart` that will compile the accompanying txt files into 30+ variables.  I can thus read the function into Python and save the variables they've defined like this:

```
from Read_PMFlexPart import Read_PMFlexpart
LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000 = Read_PMFlexpart('./')
```

Notice that unlike R, Python can important a single function from a package/library: `from {PACKAGE/FILE} import {FUNCTION}`.  Otherwise, think of it as the R equivalent of `source()`.

The second line in the code chunk above is a "multiple return" - it's another feature of Python that doesn't exist in R.

## 4. Exploring the data

Now that we have our variables, we can explore them a bit. To look at length, we can use the `len()` function (R translation: `length()`). To run `len()` on all our variables, we can use a type of for-loop:

```
vars = [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000]
[len(var) for var in vars]
```

Alternatively, you can try the Python equivalent of `*apply()`, which is `map()`.

```
map(len, [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000])
```

From either of the code chunks above, we see that all variables have 1619368 observations, except for *PZE_M*, which has 0. We'll want to take this into account for the next steps.

## 5. Turning all the variables generated by a script into a dataframe

Python doesn't have a built-in way for handling data frames, so that's where the Pandas package comes in. If you have it installed, you can load it with `import pandas as pd`. We use `pd` as an abbreviation for Pandas, which makes for less typing when we call functions from it.  To create a data frame, for example, you can use:

```
output = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
```

If we were to translate the syntax to R, it would be something along the lines of:

```
output <- pd::DataFrame(x = c(1, 2, 3), y = c(4, 5, 6))
```

To save a single one of our variables from before to a data frame, you would want to run:

```
output = pd.DataFrame({'LIN': LIN})
```

If you use brackets around the second `LIN` (i.e., `[LIN]`), you would get a one-row data frame.

To try to get them all, you could manually type out all the variables using the above syntax. Alternatively, you could come up with a non-manual solution using some Python magic. Here's how we did it for our system:

```
var_names = [var for var in dir() if var.upper() == var and not var.startswith('_') and var != "PZE_M"]
var_dict = dict(zip(var_names, [eval(var) for var in var_names]))
output = pd.DataFrame(var_dict)
```

The first line does some extra filtering to grab the relevant variables from our Python environment and exclude *PZE_M*. The second line creates a dictionary using Python tricks that I won't get into for now. Finally, the third line converts the variable dictionary to a dataframe, which is then saved to `output`.

## 6. Saving the dataframe to a csv

If you've made it this far, then the last step is a one-line breeze: `output.to_csv('output.csv')` (R translation: `write.csv()`).

Like in R, the output defaults to having numbered row names (`row.names = TRUE`). If you don't want to add them, you can use `index = False` like so: `output.to_csv("output.csv", index = False)`.

Check the folder you've been working in, and you should see it pop up in no time!
	---
	title: "Intro to Python and the Command Line for an R User"
	author: "Irene Steves"
	date: "`r format(Sys.Date())`"
	output: github_document
	---

	Some things are just faster on the command line. But if you're not used to it, the command line is a dark and scary place. Thankfully, I had [amoeba](https://github.com/amoeba) to help me through:

	1. Copying a file to another folder
	2. Unzipping a *.tar.gz file
	3. Running a Python (2) script
	4. Exploring the data in Python (with `len`)
	5. Turning all the variables generated by a script into a dataframe
	6. Saving the dataframe to a csv

	***

	## 1. Copy a file to another folder

	Start by accessing the datateam server on Terminal by typing `ssh USER_NAME@datateam.nceas.ucsb.edu`, followed by your password. By running things on the server, any processes running on your personal computer can continue running without slowing down.

	Then, navigate to the folder that the file of interest is in. These functions are pretty handy for that:

	- `pwd` = print working directory (R translation: `getwd()`)
	- `cd` = change directory (R translation: `setwd()`)
	- `ls` = list directory contents (R translation: `ls()`)

	Since I started out in `home/isteves`, I navigated to (and looked at) the folder of the Coopman dataset that I was interested in, like this:

	```
	cd /
	cd home/visitor/Coopman
	ls
	```

	Once you're in, it's easy to grab the file you want. Just use `cp {FROM} {TO}`:

	```
	cp DATA_PM_FlexPart.tar.gz ~/
	```

	## 2. Unzipping a *.tar.gz file

	Navigate into the directory that the tar file was saved into using `cd` like before, and run:

	```
	tar xzvf DATA_PM_FlexPart.tar.gz
	```

	The `tar` command is short for `tar e{x}tract} g{z}ip {v}verbose {f}ile {the file}`. Some ([amoeba](https://github.com/amoeba)) call it "shorthand magic" and indeed it is.

	Note: If you're curious about a command line function, you can check out the details using `man` (R translation: `?` or `help()`). For example, `man tar` tells you all the possible commands you can use with `tar`. To quote [amoeba](https://github.com/amoeba): "A `man` page is like the shop manual for a car which is often overkill for [a beginner's] line of inquiry." An easier resource for deciphering command line code is [tldr](https://tldr.ostera.io/). Just search the function to find the English translation!

	## 3. Running a Python (2) script

	There are some details of running python scripts that we skimmed over during our learning session (in particular, installing Python libraries), so the following assumes that you have the Python infrastructure ready to use.

	Start by navigating into the folder of the script you want to check out. In my case, I used `cd DATA_TEST` to get to the folder that I had unzipped.

	Then, use `ipython` to start a Python session. It should display something similar to the following:
	```
	Python 2.7.12 (default, Dec 4 2017, 14:50:18)
	Type "copyright", "credits" or "license" for more information.

	IPython 2.4.1 -- An enhanced Interactive Python.
	? -> Introduction and overview of IPython's features.
	%quickref -> Quick reference.
	help -> Python's own help system.
	object? -> Details about 'object', use 'object??' for extra details.
	```

	The file I'm interested in [co-locates](https://en.wikipedia.org/wiki/Collocation_(remote_sensing)) a bunch of sensor files. In other words, it takes data from a lot of different sources and synthesizes them. Specifically, the file contains one function `Read_PMFlexpart` that will compile the accompanying txt files into 30+ variables. I can thus read the function into Python and save the variables they've defined like this:

	```
	from Read_PMFlexPart import Read_PMFlexpart
	LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000 = Read_PMFlexpart('./')
	```

	Notice that unlike R, Python can important a single function from a package/library: `from {PACKAGE/FILE} import {FUNCTION}`. Otherwise, think of it as the R equivalent of `source()`.

	The second line in the code chunk above is a "multiple return" - it's another feature of Python that doesn't exist in R.

	## 4. Exploring the data

	Now that we have our variables, we can explore them a bit. To look at length, we can use the `len()` function (R translation: `length()`). To run `len()` on all our variables, we can use a type of for-loop:

	```
	vars = [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000]
	[len(var) for var in vars]
	```

	Alternatively, you can try the Python equivalent of `*apply()`, which is `map()`.

	```
	map(len, [LIN, COL, LAT, LON, PZE, PZE_P, PZE_M, Std_PZE, RAD, PRES_P, PRES_M, PRES_PM, ALT_PM, TAU, TMP, PX, DATE, LWP, CFC, Sea_Ice, DOM, GFED, SHIP, FLAR, IND, AWB, Q850, P0, T_500, T_700, T_850, T_925, T_1000])
	```

	From either of the code chunks above, we see that all variables have 1619368 observations, except for PZE_M, which has 0. We'll want to take this into account for the next steps.

	## 5. Turning all the variables generated by a script into a dataframe

	Python doesn't have a built-in way for handling data frames, so that's where the Pandas package comes in. If you have it installed, you can load it with `import pandas as pd`. We use `pd` as an abbreviation for Pandas, which makes for less typing when we call functions from it. To create a data frame, for example, you can use:

	```
	output = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
	```

	If we were to translate the syntax to R, it would be something along the lines of:

	```
	output <- pd::DataFrame(x = c(1, 2, 3), y = c(4, 5, 6))
	```

	To save a single one of our variables from before to a data frame, you would want to run:

	```
	output = pd.DataFrame({'LIN': LIN})
	```

	If you use brackets around the second `LIN` (i.e., `[LIN]`), you would get a one-row data frame.

	To try to get them all, you could manually type out all the variables using the above syntax. Alternatively, you could come up with a non-manual solution using some Python magic. Here's how we did it for our system:

	```
	var_names = [var for var in dir() if var.upper() == var and not var.startswith('_') and var != "PZE_M"]
	var_dict = dict(zip(var_names, [eval(var) for var in var_names]))
	output = pd.DataFrame(var_dict)
	```

	The first line does some extra filtering to grab the relevant variables from our Python environment and exclude PZE_M. The second line creates a dictionary using Python tricks that I won't get into for now. Finally, the third line converts the variable dictionary to a dataframe, which is then saved to `output`.

	## 6. Saving the dataframe to a csv

	If you've made it this far, then the last step is a one-line breeze: `output.to_csv('output.csv')` (R translation: `write.csv()`).

	Like in R, the output defaults to having numbered row names (`row.names = TRUE`). If you don't want to add them, you can use `index = False` like so: `output.to_csv("output.csv", index = False)`.

	Check the folder you've been working in, and you should see it pop up in no time!