Skip to content

Instantly share code, notes, and snippets.

@jknowles
Created May 27, 2013 22:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save jknowles/5659390 to your computer and use it in GitHub Desktop.
Save jknowles/5659390 to your computer and use it in GitHub Desktop.
R Markdown of blog post on R minimal working examples (MWE).
How to Ask for Help using R
========================================================
The key to getting good help with an R problem is to provide a minimally working
reproducible example (MWRE). Making an MWRE is really easy with R, and it will
help ensure that those helping you can identify the source of the error, and
ideally submit to you back the corrected code to fix the error instead of sending
you hunting for code that works. To have an MWRE you need the following items:
- a minimal dataset that produces the error
- the minimal runnable code necessary to produce the data, run on the dataset
provided
- the necessary information on the used packages, R version, and system
- a `seed` value, if random properties are part of the code
Let's look at the tools available in R to help us create each of these components
quickly and easily.
### Producing a Minimal Dataset
There are three distinct options here:
1. Use a built in R dataset
2. Create a new vector / data.frame from scratch
3. Output the data you are currently working on in a shareable way
Let's look at each of these in turn and see the tools R has to help us do this.
#### Built in Datasets
There are a few canonical buit in R datasets that are really attractive for use in
help requests.
- mtcars
- diamonds (from ggplot2)
- iris
To see all the available datasets in R, simply type: `data()`. To load any of
these datasets, simply use the following:
```{r, comment=NA}
data(mtcars)
head(mtcars) # to look at the data
```
This option works great for a problem where you know you are having trouble with
a command in R. It is not a great option if you are having trouble understanding
why a command you are familiar with won't work on your data.
Note that for education data that is fairly "realistic", there are built in
simulated datasets in the `eeptools` package, created by Jared Knowles.
```{r eeptoolsdemo, message=FALSE, warning=FALSE, comment=NA}
library(eeptools)
data(stulevel)
names(stulevel)
```
#### Create Your Own Data
Inputing data into R and sharing it back out with others is really easy. Part of
the power of R is the ability to create diverse data structures very easily.
Let's create a simulated data frame of student test scores and demographics.
```{r createdata, comment=NA}
Data <- data.frame(
id = seq(1, 1000),
gender = sample(c("male", "female"), 1000, replace = TRUE),
mathSS = rnorm(1000, mean = 400, sd = 60),
readSS = rnorm(1000, mean= 370, sd = 58.3),
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
)
head(Data)
```
And, just like that, we have simulated student data. This is a great way to
evaluate problems with plotting data or with large datasets, since we can ask
R to generate a random dataset that is incredibly large if necessary. However,
let's look at the relationship among our variables using a quick plot:
```{r evalsimmeddata}
library(ggplot2)
qplot(mathSS, readSS, data=Data, color=race) + theme_bw()
```
It looks like race is pretty evenly distributed and there is no relationship
among `mathSS` and `readSS`. For some applications this data is sufficient, but
for others we may wish for data that is more realistic.
```{r evalsimmeddata2, comment=NA}
table(Data$race)
cor(Data$mathSS, Data$readSS)
```
#### Output Your Current Data
Sometimes you just want to show others the data you are using and see why
the problem won't work. The best practice here is to make a subset of the data
you are working on, and then output it using the `dput` command.
```{r dataoutput, comment=NA}
dput(head(stulevel, 5))
```
The resulting code can be copied and pasted into an R terminal and it will
automatically build the dataset up exactly as described. Note, that in the above
example, it might have been better if I first cut out all the unnecessary
variables for my problem before I executed the `dput` command. The goal is to
make the data only necessary to reproduce your code available.
Also, note, that we never send **student level** data from LDS over e-mail
as this is unsecure. For work on student level data, it is better to either
simulate the data or to use the built in simulated data from the `eeptools`
package to run your examples.
#### Anonymizing Your Data
It may also be the case that you want to `dput` your data, but you want to keep
the contents of your data anonymous. A Google search came up with a decent
looking function to carry this out:
```{r anonymizedata, comment=NA}
anonym<-function(df){
if(length(df)>26){
LETTERS<-replicate(floor(length(df)/26),{LETTERS<-c(LETTERS, paste(LETTERS, LETTERS, sep=""))})
}
names(df)<-paste(LETTERS[1:length(df)])
level.id.df<-function(df){
level.id<-function(i){
if(class(df[,i])=="factor" | class(df[,i])=="character"){
column<-paste(names(df)[i],as.numeric(as.factor(df[,i])), sep=".")}else if(is.numeric(df[,i])){
column<-df[,i]/mean(df[,i], na.rm=T)}else{column<-df[,i]}
return(column)}
DF <- data.frame(sapply(seq_along(df), level.id))
names(DF) <- names(df)
return(DF)}
df<-level.id.df(df)
return(df)}
test <- anonym(stulevel)
head(test[, c(2:6, 28:32)])
```
That looks pretty generic and anonymized to me!
#### Notes
- Most of these solutions do not include missing data (NAs) which are often the
source of problems in R. That limits their usefulness.
- So, always check for NA values.
### Creating the Example
Once we have our minimal dataset, we need to reproduce our error on *that dataset.*
This part is critical. If the error goes away when you apply your code to the
minimal dataset, then it will be very hard for others to diagnose the problem
remotely, and it might be time to get some "at your desk" help.
Let's look at an example where we have an error aggregating data. Let's assume
I am creating a new data frame for my example, and trying to aggregate that data
by race.
```{r aggregationproblems, comment=NA}
Data <- data.frame(
id = seq(1, 1000),
gender = sample(c("male", "female"), 1000, replace = TRUE),
mathSS = rnorm(1000, mean = 400, sd = 60),
readSS = rnorm(1000, mean= 370, sd = 58.3),
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
)
myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
head(myAgg)
```
Why do I get an error? Well, if you sent the above code to someone, they could
quickly evaluate it for errors, and look at the mistake if they knew you were
attempting to use the data.table package.
```{r aggregationsolution, comment=NA, warning=FALSE}
library(data.table)
Data <- data.frame(
id = seq(1, 1000),
gender = sample(c("male", "female"), 1000, replace = TRUE),
mathSS = rnorm(1000, mean = 400, sd = 60),
readSS = rnorm(1000, mean= 370, sd = 58.3),
race = sample(c("H", "B", "W", "I", "A"), 1000, replace = TRUE)
)
Data <- data.table(Data)
myAgg <- Data[, list(meanM = mean(mathSS)), by= race]
head(myAgg)
```
### Session Info
However, they might not know this, so we need to provide one final piece of
information. This is known was the `sessionInfo` for our R session. To diagnose
the error it is necessary to know what system you are running on, what packages
are loaded in your workspace, and what version of R and a given package you are
using.
Thankfully, R makes this incredibly easy. Just tack on the output from the
`sessionInfo()` function. This is easy enough to copy and paste or include in
a `knitr` document.
```{r sessioninfo, comment=NA}
sessionInfo()
```
### Resources
For more information, visit:
- [http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)
- [https://github.com/hadley/devtools/wiki/Reproducibility](https://github.com/hadley/devtools/wiki/Reproducibility)
- [http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688](http://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l/10458688#10458688)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment