Skip to content

Instantly share code, notes, and snippets.

@mrdwab
Last active March 3, 2020 06:10
Show Gist options
  • Save mrdwab/6123681 to your computer and use it in GitHub Desktop.
Save mrdwab/6123681 to your computer and use it in GitHub Desktop.
`reshape()` for "unbalanced" datasets.
uReshape <- function(data, id.vars, var.stubs, sep) {
# vectorized version of grep
vGrep <- Vectorize(grep, "pattern", SIMPLIFY = FALSE)
# Isolate the columns starting with the var.stubs
temp <- names(data)[names(data) %in% unlist(vGrep(var.stubs, names(data), value = TRUE))]
# Split the vector and reasemble into a data.frame
x <- do.call(rbind.data.frame, strsplit(temp, split = sep))
names(x) <- c("VAR", paste(".time", 1:(length(x)-1), sep = "_"))
# Prep to decide whether normal reshape or unbalanced reshape
xS <- split(x$.time_1, x$VAR)
xL <- unique(unlist(xS))
if (isTRUE(all(sapply(xS, function(x) all(xL %in% x))))) {
# Everything looks ok for normal `reshape` to work
reshape(data, direction = "long", idvar = id.vars,
varying = lapply(vGrep(var.stubs, names(data), value = TRUE), sort),
sep = sep, v.names = var.stubs)
} else {
# Padding required to "balance" the data
# Find out which variables need to be padded
newVars <- unlist(lapply(names(xS), function(y) {
temp <- xL[!xL %in% xS[[y]]]
if (length(temp) == 0) {
temp <- NULL
} else {
paste(y, temp, sep = sep)
}
}))
# Create matrix of NAs
myMat <- setNames(data.frame(matrix(NA, nrow = nrow(data), ncol = length(newVars))), newVars)
# Bind with original data.frame
out <- cbind(data, myMat)
# Use `reshape` as normal
reshape(out, direction = "long", idvar = id.vars,
varying = lapply(vGrep(var.stubs, names(out),
value = TRUE), sort),
sep = sep, v.names = var.stubs)
}
}
@mrdwab
Copy link
Author

mrdwab commented Jul 31, 2013

R's reshape() function was written with longitudinal datasets in mind. Perhaps because of this, R is unhappy when you try to reshape "unbalanced" wide data--datasets where data for time-varying variables are missing a "time".

Here's an example of such a dataset. Note that there are three "times" for varA, two "times" for varB, and one "time" for varC.

set.seed(1)
mydf <- data.frame(id_1 = 1:6, id_2 = c("A", "B"), varA_1 = sample(letters, 6), 
                   varA_2 = sample(letters, 6), varA_3 = sample(letters, 6),
                   varB_2 = sample(10, 6), varB_3 = sample(10, 6),
                   varC_3 = rnorm(6), other = c("a1", "a2", "a3"),
                   variables = c(1, 2, 3, 1, 1, 1))
mydf
#   id_1 id_2 varA_1 varA_2 varA_3 varB_2 varB_3      varC_3 other variables
#1    1    A      g      y      r      4      3 -0.04493361    a1         1
#2    2    B      j      q      j      7      4 -0.01619026    a2         2
#3    3    A      n      p      s      8      1  0.94383621    a3         3
#4    4    B      u      b      l      2     10  0.82122120    a1         1
#5    5    A      e      e      p     10      6  0.59390132    a2         1
#6    6    B      s      d      u      1      2  0.91897737    a3         1

Here's the error you get if you try to use reshape() on it:

reshape(mydf, direction = "long", idvar = c("id_1", "id_2"), 
        varying = 3:8, sep = "_")
# Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  : 
#   'varying' arguments must be the same length

A couple of ideas came to mind for dealing with such data when I came across a few datasets structured this way:

  • stack and merge -- likely to be slow and inefficient
  • Create columns of NA values to "balance" the dataset, and then use reshape

To that end, I wrote a function called uReshape() which does its job quite nicely for the data that I designed it to work with, but I'm not sure how "generally applicable" it is.

uReshape(mydf, id.vars=c("id_1", "id_2"), 
         var.stubs = c("varA", "varB", "varC"), 
         sep = "_")
#       id_1 id_2 other variables time varA varB        varC
#1.A.1    1    A    a1         1    1    g   NA          NA
#2.B.1    2    B    a2         2    1    j   NA          NA
#3.A.1    3    A    a3         3    1    n   NA          NA
#4.B.1    4    B    a1         1    1    u   NA          NA
#5.A.1    5    A    a2         1    1    e   NA          NA
#6.B.1    6    B    a3         1    1    s   NA          NA
#1.A.2    1    A    a1         1    2    y    4          NA
#2.B.2    2    B    a2         2    2    q    7          NA
#3.A.2    3    A    a3         3    2    p    8          NA
#4.B.2    4    B    a1         1    2    b    2          NA
#5.A.2    5    A    a2         1    2    e   10          NA
#6.B.2    6    B    a3         1    2    d    1          NA
#1.A.3    1    A    a1         1    3    r    3 -0.04493361
#2.B.3    2    B    a2         2    3    j    4 -0.01619026
#3.A.3    3    A    a3         3    3    s    1  0.94383621
#4.B.3    4    B    a1         1    3    l   10  0.82122120
#5.A.3    5    A    a2         1    3    p    6  0.59390132
#6.B.3    6    B    a3         1    3    u    2  0.91897737

@corynissen
Copy link

Thanks man, just what I needed.

@simrinm
Copy link

simrinm commented Mar 14, 2017

This is great. How could I modify this code to convert unbalanced long data wide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment