jknowles/MWEblogpost.Rmd

## MWEblogpost.Rmd
How to Ask for Help using R
========================================================

The key to getting good help with an R problem is to provide a minimally working
reproducible example (MWRE). Making an MWRE is really easy with R, and it will
help ensure that those helping you can identify the source of the error, and
ideally submit to you back the corrected code to fix the error instead of sending
you hunting for code that works. To have an MWRE you need the following items:

- a minimal dataset that produces the error
- the minimal runnable code necessary to produce the data, run on the dataset
provided
- the necessary information on the used packages, R version, and system
- a `seed` value, if random properties are part of the code

Let's look at the tools available in R to help us create each of these components
quickly and easily.

### Producing a Minimal Dataset

There are three distinct options here:

1. Use a built in R dataset
2. Create a new vector / data.frame from scratch
3. Output the data you are currently working on in a shareable way

Let's look at each of these in turn and see the tools R has to help us do this.

#### Built in Datasets

There are a few canonical buit in R datasets that are really attractive for use in
help requests.

- mtcars
- diamonds (from ggplot2)
- iris

To see all the available datasets in R, simply type: `data()`. To load any of
these datasets, simply use the following:

```{r, comment=NA}
data(mtcars)
head(mtcars) # to look at the data
```

This option works great for a problem where you know you are having trouble with
a command in R. It is not a great option if you are having trouble understanding
why a command you are familiar with won't work on your data.

Note that for education data that is fairly "realistic", there are built in
simulated datasets in the `eeptools` package, created by Jared Knowles.

```{r eeptoolsdemo, message=FALSE, warning=FALSE, comment=NA}
library(eeptools)
data(stulevel)
names(stulevel)
```

#### Create Your Own Data

Inputing data into R and sharing it back out with others is really easy. Part of
the power of R is the ability to create diverse data structures very easily.
Let's create a simulated data frame of student test scores and demographics.

```{r createdata, comment=NA}
Data <- data.frame(
    id     = seq(1, 1000),
    gender = sample(c("male", "female"), 1000, replace = TRUE),
    mathSS = rnorm(1000, mean = 400, sd = 60),
    readSS = rnorm(1000, mean= 370, sd = 58.3),
    race   = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
)

head(Data)
```

And, just like that, we have simulated student data. This is a great way to
evaluate problems with plotting data or with large datasets, since we can ask
R to generate a random dataset that is incredibly large if necessary. However,
let's look at the relationship among our variables using a quick plot:

```{r evalsimmeddata}
library(ggplot2)
qplot(mathSS, readSS, data=Data, color=race) + theme_bw()
```

It looks like race is pretty evenly distributed and there is no relationship
among `mathSS` and `readSS`. For some applications this data is sufficient, but
for others we may wish for data that is more realistic.

```{r evalsimmeddata2, comment=NA}
table(Data$race)
cor(Data$mathSS, Data$readSS)
```


#### Output Your Current Data

Sometimes you just want to show others the data you are using and see why
the problem won't work. The best practice here is to make a subset of the data
you are working on, and then output it using the `dput` command.

```{r dataoutput, comment=NA}
dput(head(stulevel, 5))

```

The resulting code can be copied and pasted into an R terminal and it will
automatically build the dataset up exactly as described. Note, that in the above
example, it might have been better if I first cut out all the unnecessary
variables for my problem before I executed the `dput` command. The goal is to
make the data only necessary to reproduce your code available.

Also, note, that we never send **student level** data from LDS over e-mail
as this is unsecure. For work on student level data, it is better to either
simulate the data or to use the built in simulated data from the `eeptools`
package to run your examples.

#### Anonymizing Your Data

It may also be the case that you want to `dput` your data, but you want to keep
the contents of your data anonymous. A Google search came up with a decent
looking function to carry this out:

```{r anonymizedata, comment=NA}
anonym<-function(df){
  if(length(df)>26){
    LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))})
    }
    names(df)<-paste(LETTERS[1:length(df)])

    level.id.df<-function(df){
        level.id<-function(i){
      if(class(df[,i])=="factor" | class(df[,i])=="character"){
        column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){
          column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]}
          return(column)}
      DF <- data.frame(sapply(seq_along(df), level.id))
      names(DF) <- names(df)
      return(DF)}
    df<-level.id.df(df)
    return(df)}

test <- anonym(stulevel)
head(test[, c(2:6, 28:32)])
```

That looks pretty generic and anonymized to me!

#### Notes

- Most of these solutions do not include missing data (NAs) which are often the
source of problems in R. That limits their usefulness.
- So, always check for NA values.

### Creating the Example

Once we have our minimal dataset, we need to reproduce our error on *that dataset.*
This part is critical. If the error goes away when you apply your code to the
minimal dataset, then it will be very hard for others to diagnose the problem
remotely, and it might be time to get some "at your desk" help.

Let's look at an example where we have an error aggregating data. Let's assume
I am creating a new data frame for my example, and trying to aggregate that data
by race.

```{r aggregationproblems, comment=NA}
Data <- data.frame(
    id     = seq(1, 1000),
    gender = sample(c("male", "female"), 1000, replace = TRUE),
    mathSS = rnorm(1000, mean = 400, sd = 60),
    readSS = rnorm(1000, mean= 370, sd = 58.3),
    race   = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
)

myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
head(myAgg)
```

Why do I get an error? Well, if you sent the above code to someone, they could
quickly evaluate it for errors, and look at the mistake if they knew you were
attempting to use the data.table package.

```{r aggregationsolution, comment=NA, warning=FALSE}
library(data.table)
Data <- data.frame(
    id     = seq(1, 1000),
    gender = sample(c("male", "female"), 1000, replace = TRUE),
    mathSS = rnorm(1000, mean = 400, sd = 60),
    readSS = rnorm(1000, mean= 370, sd = 58.3),
    race   = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
)

Data <- data.table(Data)
myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
head(myAgg)
```

### Session Info

However, they might not know this, so we need to provide one final piece of
information. This is known was the `sessionInfo` for our R session. To diagnose
the error it is necessary to know what system you are running on, what packages
are loaded in your workspace, and what version of R and a given package you are
using.

Thankfully, R makes this incredibly easy. Just tack on the output from the
`sessionInfo()` function. This is easy enough to copy and paste or include in
a `knitr` document.

```{r sessioninfo, comment=NA}
sessionInfo()
```


### Resources

For more information, visit:

- [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)
- [https://github.com/hadley/devtools/wiki/Reproducibility](https://github.com/hadley/devtools/wiki/Reproducibility)
- [http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688](http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688)
	How to Ask for Help using R
	========================================================

	The key to getting good help with an R problem is to provide a minimally working
	reproducible example (MWRE). Making an MWRE is really easy with R, and it will
	help ensure that those helping you can identify the source of the error, and
	ideally submit to you back the corrected code to fix the error instead of sending
	you hunting for code that works. To have an MWRE you need the following items:

	- a minimal dataset that produces the error
	- the minimal runnable code necessary to produce the data, run on the dataset
	provided
	- the necessary information on the used packages, R version, and system
	- a `seed` value, if random properties are part of the code

	Let's look at the tools available in R to help us create each of these components
	quickly and easily.

	### Producing a Minimal Dataset

	There are three distinct options here:

	1. Use a built in R dataset
	2. Create a new vector / data.frame from scratch
	3. Output the data you are currently working on in a shareable way

	Let's look at each of these in turn and see the tools R has to help us do this.

	#### Built in Datasets

	There are a few canonical buit in R datasets that are really attractive for use in
	help requests.

	- mtcars
	- diamonds (from ggplot2)
	- iris

	To see all the available datasets in R, simply type: `data()`. To load any of
	these datasets, simply use the following:

	```{r, comment=NA}
	data(mtcars)
	head(mtcars) # to look at the data
	```

	This option works great for a problem where you know you are having trouble with
	a command in R. It is not a great option if you are having trouble understanding
	why a command you are familiar with won't work on your data.

	Note that for education data that is fairly "realistic", there are built in
	simulated datasets in the `eeptools` package, created by Jared Knowles.

	```{r eeptoolsdemo, message=FALSE, warning=FALSE, comment=NA}
	library(eeptools)
	data(stulevel)
	names(stulevel)
	```

	#### Create Your Own Data

	Inputing data into R and sharing it back out with others is really easy. Part of
	the power of R is the ability to create diverse data structures very easily.
	Let's create a simulated data frame of student test scores and demographics.

	```{r createdata, comment=NA}
	Data <- data.frame(
	id = seq(1, 1000),
	gender = sample(c("male", "female"), 1000, replace = TRUE),
	mathSS = rnorm(1000, mean = 400, sd = 60),
	readSS = rnorm(1000, mean= 370, sd = 58.3),
	race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
	)

	head(Data)
	```

	And, just like that, we have simulated student data. This is a great way to
	evaluate problems with plotting data or with large datasets, since we can ask
	R to generate a random dataset that is incredibly large if necessary. However,
	let's look at the relationship among our variables using a quick plot:

	```{r evalsimmeddata}
	library(ggplot2)
	qplot(mathSS, readSS, data=Data, color=race) + theme_bw()
	```

	It looks like race is pretty evenly distributed and there is no relationship
	among `mathSS` and `readSS`. For some applications this data is sufficient, but
	for others we may wish for data that is more realistic.

	```{r evalsimmeddata2, comment=NA}
	table(Data$race)
	cor(Data$mathSS, Data$readSS)
	```


	#### Output Your Current Data

	Sometimes you just want to show others the data you are using and see why
	the problem won't work. The best practice here is to make a subset of the data
	you are working on, and then output it using the `dput` command.

	```{r dataoutput, comment=NA}
	dput(head(stulevel, 5))

	```

	The resulting code can be copied and pasted into an R terminal and it will
	automatically build the dataset up exactly as described. Note, that in the above
	example, it might have been better if I first cut out all the unnecessary
	variables for my problem before I executed the `dput` command. The goal is to
	make the data only necessary to reproduce your code available.

	Also, note, that we never send student level data from LDS over e-mail
	as this is unsecure. For work on student level data, it is better to either
	simulate the data or to use the built in simulated data from the `eeptools`
	package to run your examples.

	#### Anonymizing Your Data

	It may also be the case that you want to `dput` your data, but you want to keep
	the contents of your data anonymous. A Google search came up with a decent
	looking function to carry this out:

	```{r anonymizedata, comment=NA}
	anonym<-function(df){
	if(length(df)>26){
	LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))})
	}
	names(df)<-paste(LETTERS[1:length(df)])

	level.id.df<-function(df){
	level.id<-function(i){
	if(class(df[,i])=="factor" \| class(df[,i])=="character"){
	column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){
	column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]}
	return(column)}
	DF <- data.frame(sapply(seq_along(df), level.id))
	names(DF) <- names(df)
	return(DF)}
	df<-level.id.df(df)
	return(df)}

	test <- anonym(stulevel)
	head(test[, c(2:6, 28:32)])
	```

	That looks pretty generic and anonymized to me!

	#### Notes

	- Most of these solutions do not include missing data (NAs) which are often the
	source of problems in R. That limits their usefulness.
	- So, always check for NA values.

	### Creating the Example

	Once we have our minimal dataset, we need to reproduce our error on that dataset.
	This part is critical. If the error goes away when you apply your code to the
	minimal dataset, then it will be very hard for others to diagnose the problem
	remotely, and it might be time to get some "at your desk" help.

	Let's look at an example where we have an error aggregating data. Let's assume
	I am creating a new data frame for my example, and trying to aggregate that data
	by race.

	```{r aggregationproblems, comment=NA}
	Data <- data.frame(
	id = seq(1, 1000),
	gender = sample(c("male", "female"), 1000, replace = TRUE),
	mathSS = rnorm(1000, mean = 400, sd = 60),
	readSS = rnorm(1000, mean= 370, sd = 58.3),
	race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
	)

	myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
	head(myAgg)
	```

	Why do I get an error? Well, if you sent the above code to someone, they could
	quickly evaluate it for errors, and look at the mistake if they knew you were
	attempting to use the data.table package.

	```{r aggregationsolution, comment=NA, warning=FALSE}
	library(data.table)
	Data <- data.frame(
	id = seq(1, 1000),
	gender = sample(c("male", "female"), 1000, replace = TRUE),
	mathSS = rnorm(1000, mean = 400, sd = 60),
	readSS = rnorm(1000, mean= 370, sd = 58.3),
	race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
	)

	Data <- data.table(Data)
	myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
	head(myAgg)
	```

	### Session Info

	However, they might not know this, so we need to provide one final piece of
	information. This is known was the `sessionInfo` for our R session. To diagnose
	the error it is necessary to know what system you are running on, what packages
	are loaded in your workspace, and what version of R and a given package you are
	using.

	Thankfully, R makes this incredibly easy. Just tack on the output from the
	`sessionInfo()` function. This is easy enough to copy and paste or include in
	a `knitr` document.

	```{r sessioninfo, comment=NA}
	sessionInfo()
	```


	### Resources

	For more information, visit:

	- [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)
	- [https://github.com/hadley/devtools/wiki/Reproducibility](https://github.com/hadley/devtools/wiki/Reproducibility)
	- [http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688](http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688)