Skip to content

Instantly share code, notes, and snippets.

@robjhyndman
Last active July 14, 2022 10:00
Show Gist options
  • Save robjhyndman/d9eb5568a78dbc79f7acc49e22553e96 to your computer and use it in GitHub Desktop.
Save robjhyndman/d9eb5568a78dbc79f7acc49e22553e96 to your computer and use it in GitHub Desktop.
tsCV with xreg
library(forecast)
fc <- function(y, h, xreg, newxreg) {
fit <- auto.arima(y, xreg=xreg)
forecast(fit, xreg=newxreg, h=h)
}
y <- ts(rnorm(100))
x <- matrix(ts(rnorm(100)),ncol=1)
tsCV(y, fc, xreg=x, h=1)
@robjhyndman
Copy link
Author

You have used Y.train and X.forc, but created y and x. The code works if you use tsCV(y, fc, xreg=x, h=1).

@mvgarcia07
Copy link

So sorry for my previews typo and thanks for your support, if I run this code, I get the same error.

y  <- ts(rnorm(50), start=(1999),fr=4)
x  <- ts(matrix(rnorm(100),ncol=2),start=(1999),fr=4)
Y.train <- window(y,start = c(1999,2), end=c(2010,1))
X.forc  <- window(x,start = c(1999,2), end=c(2011,2))
e = tsCV(Y.train, fc, xreg=X.forc, h=1)

@robjhyndman
Copy link
Author

Your xreg needs to be of the same length as the training data. Change the end argument to match.

@mvgarcia07
Copy link

Thanks for the support, it worked.

@balajikesavan90
Copy link

what is the significance of this piece of code if xreg needs to be the same length as training data? wouldn't the IF condition always return FALSE?
if(NROW(xreg) < length(y) + h)
stop("Not enough xreg data for forecasting")

@robjhyndman
Copy link
Author

xreg should contain data for both the training and test periods.

@local-maxima
Copy link

Thanks. This is very useful. I have another question. I use tsCV in the forecast package to evaluate forecasts of different models over different forecasting horizons.
However, when I run the following multivariate mlp, e_mlp consists out of NA NA NA.
I would very much appreciate any help.
See below the minimal reproducible example.

library(forecast)
library(RSNNS)
xreg = as.matrix(cbind(1:120, 121:240),ncol=2)
y <- as.ts(1:120)
f <- function(xreg, y,h) {
  X <- xreg[1:length(y)]
  newX <- xreg[1:(length(y)+h)]
  fit <- mlp(y,xreg=matrix(X))
  forecast(fit, xreg=matrix(newX), h=h)
}
e_mlp <- tsCV(xreg = xreg, y= y, f, h= 15)
mse_mlp <- (colMeans(e_mlp^2, na.rm=TRUE))

when I leave out the cross-validation it works:

xreg = as.matrix(cbind(1:120, 121:240),ncol=2)
y <- as.ts(1:120)
h = 15
X <- xreg[1:length(y)]
newX <- xreg[1:(length(y)+h)]
fit <- mlp(y,xreg=matrix(X))
forecast(fit, xreg=matrix(newX), h=h)

@robjhyndman
Copy link
Author

The code without cross-validation doesn't work. mlp() does not have an xreg argument or a forecast method. You can fit the model using

fit <- mlp(x=xreg[1:length(y),], y=y)

but to make predictions, you will need to use the predict.rsnns() function.

@karlindberg
Copy link

Hello Rob! I have a question about how to properly use the tsCV function in R. I've made up some numbers in order to try to create a scenario as close to the data that I have as possible, which is why I've used an Arima(1,1,0) below, although it might not make much sense with the made up data. Let's say that I have the following:

Y <- c(10,12,16,18,22,21,22,19,18,15,12,13,15,16,14,16,18,19,22,NA)

X1 <- c(13,15,19,21,20,23,20,16,14,10,12,14,15,14,16,17,20,22,24,20)

X2 <- c(27,31,40,41,40,45,42,35,32,27,25,27,31,31,29,33,37,41,42,56)

X_all <- cbind(X1,X2)

m1 <- Arima(Y, order = c(1,1,0), xreg = X_all, method = "ML")
forecasts <- forecast(m1, xreg = X_all)

i.e. that I have 20 observations of X1 and X2, and 19 of Y. I want to forecast the 20th observation of Y based on the Arima(1,1,0) model along with the X1 and X2 values, and I do so using the forecast functions and get a value.

Now, let's assume that I want to compare how good this model (m1) is at forecasting values of Y compared to another hypothetical model. I want to cross-validate this model using the tsCV function, and compare rmse-values between m1 and other models.

I have tried many different ways of writing the tsCV-code, but I think that I don't quite understand how to use it.

What I'm doing right now is the following:

tsCV(Y, forecasts, h = 1)

but I can only seem to generate a bunch of NA.

Any help is highly appreciated!

@robjhyndman
Copy link
Author

This gist is out-of-date. See the help file for tsCV() where there is an example of exactly what you are asking.

library(forecast)
Y <- c(10, 12, 16, 18, 22, 21, 22, 19, 18, 15, 12, 13, 15, 16, 14, 16, 18, 19, 22, NA)
X1 <- c(13, 15, 19, 21, 20, 23, 20, 16, 14, 10, 12, 14, 15, 14, 16, 17, 20, 22, 24, 20)
X2 <- c(27, 31, 40, 41, 40, 45, 42, 35, 32, 27, 25, 27, 31, 31, 29, 33, 37, 41, 42, 56)
X_all <- cbind(X1, X2)

arima_xreg <- function(x, h, xreg, newxreg) {
  forecast(Arima(x, order = c(1, 1, 0), xreg = xreg), xreg = newxreg)
}

tsCV(Y, arima_xreg, xreg = X_all)
#> Time Series:
#> Start = 1 
#> End = 20 
#> Frequency = 1 
#>  [1]            NA            NA -9.136603e-06            NA -4.493514e+00
#>  [6] -2.205889e+00 -3.136130e+00  8.759787e-01 -3.802611e+00 -2.867779e+00
#> [11]  5.554371e-01 -1.070903e-01  5.036845e-01  4.452301e-01 -1.525549e-01
#> [16]  6.553010e-01 -6.584875e-01  3.012134e+00            NA            NA

Created on 2021-05-11 by the reprex package (v2.0.0)

@karlindberg
Copy link

Thank you very much for your quick reply! Much appreciated!

I see, that makes sense. One more question however - how come that observation 4 of the tsCV output is NA, while observation 3 and 5 is not? I'm having trouble understanding the logic.

Also, on that topic:
I assume that the first 2 observations of tsCV are NA since it cannot forecast these values with so little data, and that the last observation (20) is NA since it cannot compare the forecast to any true value?
How come that observation 19 of tsCV is NA? Shouldnt it be possible to forecast it and compare to the real value?

@robjhyndman
Copy link
Author

When fitting a model to a tiny data set, sometimes it will return some coefficients and sometimes it won't due to numerical difficulties. Observation 4 will be NA because it was unable to fit a model to the training data. Observation 19 is NA because the last element of Y is NA so it can't compute the difference between the forecast and the actual.

@karlindberg
Copy link

Okay, I see! Thank you very much for your explanations Rob!

@jvdp99
Copy link

jvdp99 commented May 14, 2021

Hello Rob, I have tried running your example

> fc <- function(y, h, xreg)
> {
>   ncol <- NCOL(xreg)
>   X <- matrix(xreg[1:length(y), ], ncol = ncol)
>   if(NROW(xreg) < length(y) + h)
>     stop("Not enough xreg data for forecasting")
>   newX <- matrix(xreg[length(y) + (1:h), ], ncol = ncol)
>   fit <- auto.arima(y, xreg=X)
>   forecast(fit, xreg = newX, h=h)
> }
> 
> y <- ts(rnorm(50))
> x <- ts(matrix(rnorm(100),ncol=2))
> 
> tsCV(y, fc, xreg=x, h=1)
>

but all I get is NA's. When I check the fc() function separately, I get the stop message "Not enough xreg data for forecasting". It seems inherent in the construction of y and x that NROW(xreg) < length(y) + h always holds. Am I doing something wrong here? Thank you.

EDIT: I read that this gist is out-of-date. When using the example from the help file, it works fine.

far2_xreg <- function(x, h, xreg, newxreg) {
  forecast(Arima(x, order=c(2,0,0), xreg=xreg), xreg=newxreg)
}

y <- ts(rnorm(50))
xreg <- matrix(rnorm(100),ncol=2)
e <- tsCV(y, far2_xreg, h=3, xreg=xreg)

But when I use auto.arima instead of Arima, it suddenly does not work anymore.

far2_xreg <- function(x, h, xreg, newxreg) {
  forecast(auto.arima(x, xreg=xreg), xreg=newxreg)
}

y <- ts(rnorm(50))
xreg <- matrix(rnorm(100),ncol=2)
e <- tsCV(y, far2_xreg, h=3, xreg=xreg)

I get the error message:

Error in NextMethod("[<-") : 
  number of items to replace is not a multiple of replacement length
In addition: Warning message:
In `-.default`(y[i + (1:h)], fc$mean) :
  longer object length is not a multiple of shorter object length

Any ideas why that is the case? Thanks!

@robjhyndman
Copy link
Author

You will need to provide the forecast horizon to the forecast function:

far2_xreg <- function(x, h, xreg, newxreg) {
  forecast(auto.arima(x, xreg=xreg), h=h, xreg=newxreg)
}

@jvdp99
Copy link

jvdp99 commented May 18, 2021

Thank you Rob, that works wonders!

Another question that comes to mind is how to implement a rolling window in this matter. I have read your other posts about time series cross validation, I understand the concept, but it seems like adding a window command in the forecast function leads to NA results again.

More specifically, I have time series of varying length (100 to 800 observations) and I use 11 dummy variables for months. If I understand correctly, if window=null, tsCV just uses an expanding window. Instead, I would like to use a rolling window forecast as I suspect this will perform better in my data because there are structural breaks.

Hence, my question is two-fold. Firstly, how should one specify the use of a rolling window correctly in the following

far2_xreg <- function(x, h, xreg, newxreg) {
  forecast(auto.arima(x, xreg=xreg), xreg=newxreg)
}

y <- ts(rnorm(50))
xreg <- matrix(rnorm(100),ncol=2)
e <- tsCV(y, far2_xreg, h=3, xreg=xreg, window = )

Secondly, how does one determine the optimal window length? Ideally, it should be as small as possible. Is there a function available in forecast package that uses cross-validation to determine an optimal value for this as well?

Thank you in advance.

@robjhyndman
Copy link
Author

  1. Using a window works with this code. Just set window=20 (or whatever value you like).
  2. Why should it be small? The window is the amount of data in each training set. If it is too small you won't have enough data to fit a model. I normally do not use a window as I want to use as much training data as possible. You would have to define what you mean by optimal. It is not obvious how to define optimal in this context.

@jvdp99
Copy link

jvdp99 commented May 19, 2021

Thanks for your fast response once again Rob, greatly appreciated!

The rolling window does indeed work, it seems that I set my value too low which caused the NA's to occur.

To give more context to the window size, my objective is to get my forecast accuracy as high as possible. I am afraid that if I take my window size too large, the model fit will become increasingly good but the future forecasts might get worse. In my case, the definition of an optimal window would be one that optimizes the forecast accuracy.

For context, I am modelling the number of hospital appointments on a weekly basis and my goal is to predict with high accuracy the amount of appointments in the next week(s).

Kind regards

@robjhyndman
Copy link
Author

Increased training data does not lead to overfitting.

@pgbutler
Copy link

Hi Rob,

For these 2 functions:

arima_xreg <- function(x, h, xreg, newxreg) {
forecast(Arima(x, order = c(1, 1, 0), xreg = xreg), xreg = newxreg)
}

tsCV(Y, arima_xreg, xreg = X_all)

The call to arima_xreg in tsCV has a xreg variable but no newxreg variable. Is the tsCV function automatically splitting the xreg data (when populated) in proportion to how Y is split and it automatically sends this split data as xreg, newxreg into the arima_xreg function?

Complicated but clever if so!

Appreciate the help.

@robjhyndman
Copy link
Author

Yes, the data splitting is all handled within tsCV()

@pgbutler
Copy link

Thanks for the quick response!

@markolucic97
Copy link

Hi Rob,

I have 2 data sets, one from Intraday electricity market(dependent variable) and one for day-ahead electricity market(explanatory variable). I also noticed that there is seasonality so I included fourier series to deal with it. To find the appropriate ARIMA model I used the auto.arima function. I have also split my data set into training and test. The training sets contain around 14k observation and test set around 3.5k observations. The results from the code below suggest using regression with ARIMA(5,0,2) errors and Fourier series with K = 7.
bestfit <- list(aicc=Inf) for(i in 1:25) { fit <- auto.arima(elbas_training, xreg=cbind(as.matrix(ts_elspot),fourier(gas, K=i)), seasonal=FALSE) print(i) if(fit$aicc < bestfit$aicc) bestfit <- fit else break; }

My question is the following code. What time series should I include in the tsCV function for "y". Should it be the whole data set(training + test), the time series that I used for training the model or the test set if I want to forecast for the next 3 periods?

`arima_xreg <- function(x, h, xreg, newxreg) {
forecast(Arima(x, order = c(5, 0, 2), xreg = xreg), xreg = newxreg)
}

tsCV(Y, arima_xreg, xreg = X_all, h=3)`

@robjhyndman
Copy link
Author

robjhyndman commented Aug 11, 2021

First, here is how to do it with auto.arima(), to answer the previous question now deleted.

library(forecast)

fc <- function(y, h, xreg, newxreg) {
  fit <- auto.arima(y, xreg=xreg)
  forecast(fit, xreg=newxreg, h=h)
}

y <- ts(rnorm(100))
x <- matrix(ts(rnorm(100)),ncol=1)
tsCV(y, fc, xreg=x, h=1)

@robjhyndman
Copy link
Author

Second, tsCV() does its own training and test splits. If you want to do a specific training and test split, don't use tsCV(). See https://otexts.com/fpp2/accuracy.html

@khmahmood
Copy link

Thanks a lot for your efforts to provide this excellent function. I have tried to apply tscv for multiple regression (one response and a number of predictors, but unfortunately, it didn't work out, I just got "NA". Could you please help me? Thanks in advance.

@robjhyndman
Copy link
Author

Please post a reproducible example at http://rstd.io/forecast-package

@khmahmood
Copy link

Thank you so much for the quick response. Sorry, something went wrong when I tried to use the RStudio Community as this first time to use it. The problem is simply I used your code for the Example with exogenous predictors, but with lm function instead of ARIMA. I got NA.

x1 <- c(23, 43, 42, 65, 20, 23, 20, 16, 14, 72, 12, 14, 22, 14, 98, 11, 20, 22, 24, 44)
X2 <- c(27, 31, 40, 41, 40, 45, 41, 35, 36, 27, 25, 27, 37, 31, 29, 90, 37, 41, 24, 56)
y <- c(13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 66, 44, 76, 88, 99, 55, 33)
X_matrix <- cbind(X1, X2)

fu <- function(x, h, xreg, newxreg) {
forecast(lm(formula,data, xreg = xreg), xreg = newxreg)
}

e<-tsCV(y, fu, xreg = X_matrix)

@robjhyndman
Copy link
Author

For a linear model, use tslm, not lm.

@Niloojan
Copy link

Niloojan commented Jul 1, 2022

Dear Rob,
I am doing multi step ahead ex-ante forecast with a model with explanatory variables, for instance if its a 7 days ahead forecast of Y, I need to have 1 to 7 days forecast of X to be used recursively in 1 to seven days forecast of Y. I use a modified version of tscv that allows to have xreg_actual and xreg_forecast, then I use these codes:

far <- function(x, h, xreg, newxreg) {
forecast(Arima(x, order=c(0,0,0), xreg=xreg), xreg=newxreg)
}

xreg_actual <- data.frame(X1=X1_actual,)
xreg_forecast <- data.frame(X1=X1_onestep, X1=X1_twostep,....., X1=X7_sevenstep)

So xreg_actual includes the actual values of X1 and xreg_forecast includes X1_onestep to X7_sevenstep, which are forecasted values of X1 for the validation and test period, and are obtained from another tscv forecast. What I expect is that, each of the elements of xreg_forecast dataframe to be used in forecasting y from one to seven step ahead. for example if h=7, the tscv starts from h=1 and then recursively reaches to h=7, and I want X1_onestep to be used in h=1, X2_twostep to be used in h=2 and so on....
So, I made the following loop:

for (i in 1:ncol(xreg_forecast) {
error[[i]]<- data.frame(mytsCV(y, far, h=7, window = 365, initial =10, xreg_actual=xreg_actual,
xreg_forecast=xreg_forecast[[i]]))
}

the loop is generating the errors as it should be, however, I am not sure if the loop is atually doing what I aimed to.
Also Im not sure how can I extend the loop for models with more than one explanatory variables.
Any help is highly appriciated.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment