R Markdown of blog post on R minimal working examples (MWE).
How to Ask for Help using R | |
======================================================== | |
The key to getting good help with an R problem is to provide a minimally working | |
reproducible example (MWRE). Making an MWRE is really easy with R, and it will | |
help ensure that those helping you can identify the source of the error, and | |
ideally submit to you back the corrected code to fix the error instead of sending | |
you hunting for code that works. To have an MWRE you need the following items: | |
- a minimal dataset that produces the error | |
- the minimal runnable code necessary to produce the data, run on the dataset | |
provided | |
- the necessary information on the used packages, R version, and system | |
- a `seed` value, if random properties are part of the code | |
Let's look at the tools available in R to help us create each of these components | |
quickly and easily. | |
### Producing a Minimal Dataset | |
There are three distinct options here: | |
1. Use a built in R dataset | |
2. Create a new vector / data.frame from scratch | |
3. Output the data you are currently working on in a shareable way | |
Let's look at each of these in turn and see the tools R has to help us do this. | |
#### Built in Datasets | |
There are a few canonical buit in R datasets that are really attractive for use in | |
help requests. | |
- mtcars | |
- diamonds (from ggplot2) | |
- iris | |
To see all the available datasets in R, simply type: `data()`. To load any of | |
these datasets, simply use the following: | |
```{r, comment=NA} | |
data(mtcars) | |
head(mtcars) # to look at the data | |
``` | |
This option works great for a problem where you know you are having trouble with | |
a command in R. It is not a great option if you are having trouble understanding | |
why a command you are familiar with won't work on your data. | |
Note that for education data that is fairly "realistic", there are built in | |
simulated datasets in the `eeptools` package, created by Jared Knowles. | |
```{r eeptoolsdemo, message=FALSE, warning=FALSE, comment=NA} | |
library(eeptools) | |
data(stulevel) | |
names(stulevel) | |
``` | |
#### Create Your Own Data | |
Inputing data into R and sharing it back out with others is really easy. Part of | |
the power of R is the ability to create diverse data structures very easily. | |
Let's create a simulated data frame of student test scores and demographics. | |
```{r createdata, comment=NA} | |
Data <- data.frame( | |
id = seq(1, 1000), | |
gender = sample(c("male", "female"), 1000, replace = TRUE), | |
mathSS = rnorm(1000, mean = 400, sd = 60), | |
readSS = rnorm(1000, mean= 370, sd = 58.3), | |
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE) | |
) | |
head(Data) | |
``` | |
And, just like that, we have simulated student data. This is a great way to | |
evaluate problems with plotting data or with large datasets, since we can ask | |
R to generate a random dataset that is incredibly large if necessary. However, | |
let's look at the relationship among our variables using a quick plot: | |
```{r evalsimmeddata} | |
library(ggplot2) | |
qplot(mathSS, readSS, data=Data, color=race) + theme_bw() | |
``` | |
It looks like race is pretty evenly distributed and there is no relationship | |
among `mathSS` and `readSS`. For some applications this data is sufficient, but | |
for others we may wish for data that is more realistic. | |
```{r evalsimmeddata2, comment=NA} | |
table(Data$race) | |
cor(Data$mathSS, Data$readSS) | |
``` | |
#### Output Your Current Data | |
Sometimes you just want to show others the data you are using and see why | |
the problem won't work. The best practice here is to make a subset of the data | |
you are working on, and then output it using the `dput` command. | |
```{r dataoutput, comment=NA} | |
dput(head(stulevel, 5)) | |
``` | |
The resulting code can be copied and pasted into an R terminal and it will | |
automatically build the dataset up exactly as described. Note, that in the above | |
example, it might have been better if I first cut out all the unnecessary | |
variables for my problem before I executed the `dput` command. The goal is to | |
make the data only necessary to reproduce your code available. | |
Also, note, that we never send **student level** data from LDS over e-mail | |
as this is unsecure. For work on student level data, it is better to either | |
simulate the data or to use the built in simulated data from the `eeptools` | |
package to run your examples. | |
#### Anonymizing Your Data | |
It may also be the case that you want to `dput` your data, but you want to keep | |
the contents of your data anonymous. A Google search came up with a decent | |
looking function to carry this out: | |
```{r anonymizedata, comment=NA} | |
anonym<-function(df){ | |
if(length(df)>26){ | |
LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))}) | |
} | |
names(df)<-paste(LETTERS[1:length(df)]) | |
level.id.df<-function(df){ | |
level.id<-function(i){ | |
if(class(df[,i])=="factor" | class(df[,i])=="character"){ | |
column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){ | |
column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]} | |
return(column)} | |
DF <- data.frame(sapply(seq_along(df), level.id)) | |
names(DF) <- names(df) | |
return(DF)} | |
df<-level.id.df(df) | |
return(df)} | |
test <- anonym(stulevel) | |
head(test[, c(2:6, 28:32)]) | |
``` | |
That looks pretty generic and anonymized to me! | |
#### Notes | |
- Most of these solutions do not include missing data (NAs) which are often the | |
source of problems in R. That limits their usefulness. | |
- So, always check for NA values. | |
### Creating the Example | |
Once we have our minimal dataset, we need to reproduce our error on *that dataset.* | |
This part is critical. If the error goes away when you apply your code to the | |
minimal dataset, then it will be very hard for others to diagnose the problem | |
remotely, and it might be time to get some "at your desk" help. | |
Let's look at an example where we have an error aggregating data. Let's assume | |
I am creating a new data frame for my example, and trying to aggregate that data | |
by race. | |
```{r aggregationproblems, comment=NA} | |
Data <- data.frame( | |
id = seq(1, 1000), | |
gender = sample(c("male", "female"), 1000, replace = TRUE), | |
mathSS = rnorm(1000, mean = 400, sd = 60), | |
readSS = rnorm(1000, mean= 370, sd = 58.3), | |
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE) | |
) | |
myAgg <- Data[, list(meanM = mean(mathSS)), by= race] | |
head(myAgg) | |
``` | |
Why do I get an error? Well, if you sent the above code to someone, they could | |
quickly evaluate it for errors, and look at the mistake if they knew you were | |
attempting to use the data.table package. | |
```{r aggregationsolution, comment=NA, warning=FALSE} | |
library(data.table) | |
Data <- data.frame( | |
id = seq(1, 1000), | |
gender = sample(c("male", "female"), 1000, replace = TRUE), | |
mathSS = rnorm(1000, mean = 400, sd = 60), | |
readSS = rnorm(1000, mean= 370, sd = 58.3), | |
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE) | |
) | |
Data <- data.table(Data) | |
myAgg <- Data[, list(meanM = mean(mathSS)), by= race] | |
head(myAgg) | |
``` | |
### Session Info | |
However, they might not know this, so we need to provide one final piece of | |
information. This is known was the `sessionInfo` for our R session. To diagnose | |
the error it is necessary to know what system you are running on, what packages | |
are loaded in your workspace, and what version of R and a given package you are | |
using. | |
Thankfully, R makes this incredibly easy. Just tack on the output from the | |
`sessionInfo()` function. This is easy enough to copy and paste or include in | |
a `knitr` document. | |
```{r sessioninfo, comment=NA} | |
sessionInfo() | |
``` | |
### Resources | |
For more information, visit: | |
- [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) | |
- [https://github.com/hadley/devtools/wiki/Reproducibility](https://github.com/hadley/devtools/wiki/Reproducibility) | |
- [http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688](http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment