Helge Bjorland helgejo

## 0_reuse_code.js
// Use Gists to store code you would like to remember later on
console.log(window); // log the "window" object to the console

## import_data.r
#import data
training <- read.csv("data/adult.data", header = FALSE, na.strings = "?")

## fonts.txt
#Goto fonts for projects

Georgia for a sophisticated serif
Helvetica for a clean and neutral design
Lato for a friendly and "natural" look
Raleway for a more modern geometric look

## git_proxy_commands
#Command to use :
git config --global http.proxy http://proxyuser:proxypwd@proxy.server.com:8080

#change proxyuser to your proxy user
#change proxypwd to your proxy password
#change proxy.server.com to the URL of your proxy server
#change 8080 to the proxy port configured on your proxy server

#If you decide at any time to reset this proxy and work without (no proxy):
#Commands to use:

## Sampling from large data
 If the data is huge and can't be loaded because of RAM issues there's a very simple way to sample your data using streaming techniques. It consist in selection first randomly the lines number that you will take in your sample, and then select them.

You can either do a regular random sample, or a random stratified sample if you have an output variable Y and want to keep the same distribution in your stratified sample.

Random sample
1/ Count the number of lines of your big file by reading the file line by line, you now have nb_lines
2/Generate a list of random numbers between 1 and nb_lines called for instance selected_lines, which will correspond to the id of the lines you will select in your big base
3/Go again trough the original big data file and select the lines which matches the lines number of selected_lines and write them in a new file.

Stratified sample for a discrete output variable

## hierarchical-clustering-in-r
clusters <- hclust(dist(iris[, 3:4]))
plot(clusters)

clusterCut <- cutree(clusters, 3)

table(clusterCut, iris$Species)

clusters <- hclust(dist(iris[, 3:4]), method = 'average')
plot(clusters)

## gist:6346592b2f86fce6a91ef5fda7d87a6b
R to python useful data wrangling snippets

The dplyr package in R makes data wrangling significantly easier.
The beauty of dplyr is that, by design, the options available are limited.
Specifically, a set of key verbs form the core of the package.
Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe.
Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R.
The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package).

dplyr is organised around six key verbs
	// Use Gists to store code you would like to remember later on
	console.log(window); // log the "window" object to the console
	#import data
	training <- read.csv("data/adult.data", header = FALSE, na.strings = "?")
	#Goto fonts for projects

	Georgia for a sophisticated serif
	Helvetica for a clean and neutral design
	Lato for a friendly and "natural" look
	Raleway for a more modern geometric look
	#Command to use :
	git config --global http.proxy http://proxyuser:proxypwd@proxy.server.com:8080

	#change proxyuser to your proxy user
	#change proxypwd to your proxy password
	#change proxy.server.com to the URL of your proxy server
	#change 8080 to the proxy port configured on your proxy server

	#If you decide at any time to reset this proxy and work without (no proxy):
	#Commands to use:
	If the data is huge and can't be loaded because of RAM issues there's a very simple way to sample your data using streaming techniques. It consist in selection first randomly the lines number that you will take in your sample, and then select them.

	You can either do a regular random sample, or a random stratified sample if you have an output variable Y and want to keep the same distribution in your stratified sample.

	Random sample
	1/ Count the number of lines of your big file by reading the file line by line, you now have nb_lines
	2/Generate a list of random numbers between 1 and nb_lines called for instance selected_lines, which will correspond to the id of the lines you will select in your big base
	3/Go again trough the original big data file and select the lines which matches the lines number of selected_lines and write them in a new file.

	Stratified sample for a discrete output variable
	clusters <- hclust(dist(iris[, 3:4]))
	plot(clusters)

	clusterCut <- cutree(clusters, 3)

	table(clusterCut, iris$Species)

	clusters <- hclust(dist(iris[, 3:4]), method = 'average')
	plot(clusters)
	R to python useful data wrangling snippets

	The dplyr package in R makes data wrangling significantly easier.
	The beauty of dplyr is that, by design, the options available are limited.
	Specifically, a set of key verbs form the core of the package.
	Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe.
	Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R.
	The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package).

	dplyr is organised around six key verbs