vankesteren/debugdata.md

## debugdata.md

      
    Raw
  

              debugdata.md
            
          
    Debug Dataset

Description

The test data file for JASP. debug.csv contains 100 observations on many variables to test functions in JASP.
Columns


9 continuous variables

Normal: Normally distributed data
Gamma: Gamma distributed data (only positive!)
Binomial: random 0/1 data
Expon: Exponentiated normal data (wide interval)
Wide: a very wide interval on the order of 10^100
Narrow: a very narrow interval on the order of 10^-100
Outlier: normal data with several outliers
Cor1&2: Correlated normal data


5 factor variables

Gender: m/f coded
Experim: Experimental/Control coded
Five: Factor with 5 levels
Fifty: Factor with 50 levels
Outlier: Factor with 4 levels of which 2 are only one observation


15 debug variables

String: A-Z randomised
Miss1: Normal data with 1 missing value
Miss30: Normal data with 30 missings
Miss80: Normal data with 80 missings
Miss99: Normal data with 99 missings
BinMiss20: Binomial data with 20 missings
NaN: Full column with NaN
NaN10: Normal data with 10 NaN
Inf: Full column with Inf
Collin1&2&3: Three collinear normal variables
Equal1&2: Two normal variables with exactly the same values
Same: Column with all "12.3" values (no variance)


How to add variables

Open the R file testData.R, available as a gist, to recreate the dataset. Here, additional columns can be added. When the dataset has been generated, save it as a csv, open it with a spreadsheet editor, and replace the cells containing the value NA with empty cells.
The code

Continuous
s <- matrix(c(1,0.68,0.68,1), nrow = 2)
mvn <- mvrnorm(100,c(0,0),s)

cont <- data.frame(contNormal = rnorm(100), # Standard Normal
                   contGamma = rgamma(100,2), # Gamma Distributed
                   contBinom = rbinom(100, 1, 0.4), # Bernoulli trials
                   contExpon = exp(rnorm(100, sd = 50)), # Exponentiated normal
                   contWide = runif(100,-9e99,9e99), # Very wide interval
                   contNarrow = runif(100,-1e-99,1e-99), # Very narrow
                   contOutlier = sample(c(rnorm(95), # With outliers
                                          c(12,-23,4.5,5.7,-3.12)),100),
                   contcor1 = mvn[,1], # Multivariate normal with cor 0.68
                   contcor2 = mvn[,2])
Factors
fac <- data.frame(facGender = factor(sample(rep(c("m", "f"), 50), replace = F)),
                  facExperim = factor(rep(c("control", "experimental"), 50)),
                  facFive = factor(rep(1:5, 20)),
                  facFifty = factor(c(1:50,1:50)),
                  facOutlier = factor(c(rep(c("f1","f2"),49), "f3",
                                        "totallyridiculoussuperlongfactorname")))
Debug
col <- rbeta(100, 23, 12)
eq <- rnorm(100,10,2.5) * rgamma(100,1)

deb <- data.frame(debString = sample(letters, 100, T), # Random letter string
                  debMiss1 = sample(c(rnorm(99,10,25), NA)), # Various # Missing
                  debMiss30 = sample(c(rnorm(70,10,25), rep(NA,30))),
                  debMiss80 = sample(c(rnorm(20,10,25), rep(NA,80))),
                  debMiss99 = sample(c(rnorm(1,10,25), rep(NA,99))),
                  debBinMiss20 = sample(c(rbinom(80,1,0.6), rep(NA, 20))),
                  debNaN = rep(NaN, 100), # All NaN
                  debNaN10 = sample(c(rnorm(90,10,25), rep(NaN,10))), # 10 NaN
                  debInf = rep(Inf, 100), # All Inf values
                  debCollin1 = col, # Three multicollinear variables
                  debCollin2 = col + 2,
                  debCollin3 = col * 2,
                  debEqual1 = eq, # Two exactly equal variables
                  debEqual2 = eq,
                  debSame = rep(12.3,100)) # Exactly the same value 100 times