Last active
January 13, 2016 17:10
-
-
Save garyfeng/27e7f8e406192a8cb33a to your computer and use it in GitHub Desktop.
A function to back fill NAs in a vector with the last non-NA value
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# A function to back fill NAs in a vector with the last non-NA value | |
# https://gist.github.com/garyfeng/27e7f8e406192a8cb33a | |
backFillNA<- function (x) { | |
nas<- which(is.na(x)) | |
# trick from http://stackoverflow.com/questions/24837401/find-consecutive-values-in-vector-in-r | |
naList<-split(nas, cumsum(c(1, diff(nas) != 1))) | |
# get the backfill values | |
valueList<-lapply(naList, function(nalist) { | |
prev<- nalist[1]-1 | |
#print(nalist) | |
if (prev<=0) prev <-1 | |
x[nalist]<-x[prev] | |
return(x[nalist]) | |
}) | |
# now back fill | |
# can't use unlist(), as it converts everything to a common class. | |
# see http://stackoverflow.com/questions/13859905/returning-a-vector-of-class-posixct-with-vapply | |
#x[unlist(naList)] <- unlist(valueList) | |
x[unlist(naList)] <- do.call(c, valueList) | |
return (x) | |
} | |
x<-c("A","B",NA,NA,"C",NA,NA,NA,NA,"D",NA,NA); | |
#> x | |
# [1] "A" "B" NA NA "C" NA NA NA NA "D" NA NA | |
#> backFillNA(x) | |
# [1] "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D" "D" | |
######################## | |
# The following recursive solution is taken from: | |
# https://stat.ethz.ch/pipermail/r-help/2008-July/169199.html | |
# it uses Recall(), R's method for recursive calling, to do back filling | |
# It works well for short vectors, but a test on a vector of 20k elements with many NAs | |
# and the recursion failed after a long while. | |
rna <- function(z) { | |
y <- c(NA, head(z, -1)) | |
z <- ifelse(is.na(z), y, z) | |
if (any(is.na(z))) Recall(z) else z | |
} | |
#> x | |
# [1] "A" "B" NA NA "C" NA NA NA NA "D" NA NA | |
#> rna(x) | |
# [1] "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D" "D" |
Compared the non-recursive version with a simple for
loop based method:
loopFillNA <- function(x) {
lastV <- x[1]
for (i in 1:length(x)) {
if(is.na(x[i])) x[i] <- lastV else lastV <- x[i]
}
x
}
We have a large numeric vector t0
with mostly NAs.
Testing results:
> system.time(loopFillNA(t0[1:10000]))
user system elapsed
1.366 0.977 2.344
> system.time(rna(t0[1:10000]))
user system elapsed
0.961 0.238 1.198
> system.time(backFillNA(t0[1:10000]))
user system elapsed
0.014 0.001 0.015
hmm... the performance of the non-recursive version may be a function of how many NAs
there are -- more precisely, how many NA
runs there are in the data. We need to test this.
c.f. the Pandas implementation of forward and backward fillNA at http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-reindexing and http://pandas.pydata.org/pandas-docs/stable/missing_data.html#cleaning-filling-missing-data
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Tried this with a vector with 29K elements, and recursion was too deep.
Thinking a non-recursive approach: