-
-
Save jessegrabowski/ccda08b8a758f882f5794b8b89ace07a to your computer and use it in GitHub Desktop.
Yes, if you wanted to estimate the mean outside the scan (for example with a random walk), you would pass it as a sequence. You never know epsilon though, it is by definition unobserved. If you observe something, it's some other exogenous regressor, not epsilon.
Yes, given your formulation, I do not know epsilon. I was following the exposition from the Stan Users Guide here. Which basically substitutes your epsilon with the squared previous difference in return from the mean at the previous time step. This (I think?) is a more general statement than the ARMA(1,1) formulation you derived above, correct?
Their model is the same as my model. I would say mine is more general because I add a statement about the distribution of the innovations, whereas they tacitly assume they are normally distributed. Anyway, if you look at how sigma is updated:
sigma[t] = sqrt(alpha0
+ alpha1 * pow(r[t - 1] - mu, 2)
+ beta1 * pow(sigma[t - 1], 2));
In this STAN code, mu
is a time-invariant scalar value (exactly the same as my first example), so r[t-1] - mu
is equal to my epsilon_tm1
, because at each time step I compute epsilon = y_true - mu
(in STAN syntax, y[t] - mu
), then scan passes this value along to the next time step (because it is an outputs_info
variable), where it becomes epsilon_tm1
.
Note that if mu
were time varying in the STAN example, they would have needed to index it asmu[t-1]
as well, so you'd end up in the epsilon_tm1
for my second example.
Let's be clear that "epsilon" is a statistical fiction. It's the residual variance in the data that isn't explained by your mean model. It has nothing to do with my formulation (I'm not that special, sadly) -- by definition, you never observe error (or "innovations" in the time series literature) in statistical models
Ah yes, thank you for the clarification. This is much clearer in my mind now. Your second comment is exactly why I stated we would "know" the epsilons when we compute a time varying mu
outside of the scan. Philosophically, I agree that you do not observe the error.
Practically though, If I compute
# Heteroscedastic variance
observed is a vector of length num_dates
Estimate mu: a time varying estimate of the observed mean (also of length num_dates)
Set: std[0] = 1.0;
for t in range(1,num_dates):
std[t] = sqrt(omega + alpha1 * square(observed[t-1] - mu[t-1]) + beta1 * square(std[t-1]));
Then std
is capturing the volatility of the model (i.e. the errors that mu
is not absorbing when regressing observed
).
Does this interpretation make sense?
in particular, if I use some type of logistic regression (or something else) to produce mu
, then I would like std
to capture those times that my features do not explain observed
. To your point, there may be some underlying baseline error that is not captured at all by the features, so that would be a lower bound on the size of std
.
Maybe this is an abuse of the GARCH model formulation to think of it like this?
No you're spot on. I see where we're differing in nomenclature now. The second example is very close to what you're thinking about, some kind of regression model with GARCH errors. The model was ARIMA there, but there's no reason why it couldn't be anything you like. For example, we could have some exogenous indicators, and use a linear regression for the mean model:
def step(*args):
# It's not necessary to do this args thing, but we will need to pass all of these to
# collect_default_updates in the end, so it makes it a bit more compact.
y_true, X_exog, sigma_sq_tm1, epsilon_tm1, beta_exog, ω, α, β, μ = args
# E[y] model as a linear regression
μ = μ + X_exog @ beta_exog
# GARCH process
σ2 = ω + α * epsilon_tm1 ** 2 + β * sigma_sq_tm1
# y = E[y] + epsilon implies y ~ N(E[y], sigma)
y = pm.Normal.dist(mu=μ, sigma=pt.sqrt(σ2))
# back out the value of epsilon from y and mu
epsilon = y_true - μ
return (σ2, epsilon, y), collect_default_updates(args, [y])
[sigmas, epsilons, y_hat], updates = pytensor.scan(
fn=step,
sequences=[data, exog_data], # <-- note the new data. Pay attention to timing though; here's im saying the exog_data operates on y contemporaneously (i.e. E[y_t] = mu + beta X_t) as opposed to with a lag. That's a modeling choice!
outputs_info=[sigma_sq_init, epsilon_init, None],
non_sequences=[beta_exog, omega, alpha, beta, mu], #<-- Note the inclusion of a new vector of parameters for the linear mean model
strict=True,
mode = get_mode(None) # use get_mode("JAX") if you try to sample with numpyro
)
In principle I don't see any reason why you couldn't write anything you want for the mean model, including adding a link function for e.g. logistic regression. You could put a whole neural network there if you were really feeling frisky.
Another practical question I think you're asking is, can I just do the whole mean regression first then run the GARCH model directly on the residuals of that regression, or must I do everything together in steps? There might be an algebraic equivalence between the two, you would have to sit down and carefully look at it. I know that there are subtleties in this area, see this note on how to write a SARIMAX regression for some idea of what can "go wrong".
Personally I will recommend doing it the way I've presented because I know 100% that it's right. But if you experiment and find a way that is faster or more clear, please report back so I can do it too!
If you did do the regression first, the mean of the residuals will be zero by construction, so you could just omit the mean model from inside the scan entirely and replace mu
with 0
everywhere. You wouldn't need the 0th tap for y_true
, you could just use the -1 tap as epsilon_tm1
directly. There would be some other subtleties with setting up the likelihood too, because you can't use the residuals from the first regression as observed in the second regression. You would need to provide the actual data as observed, with mu=regression_mu
and sigma=GARCH_sigmas
. Not positive, you'll have to play around with it
Yes, the idea is to model both the regression and the GARCH
sigmas simultaneously. Then the model "ideally" attempts to explain as much as possible the observed data with the regression, and then (hopefully) uses the GARCH sigma
to absorb any unexplained variance.
I have had good success with this type of model (in particular for say modeling sales at a retailer): prior to COVID19, the GARCH sigmas
are really small, since the regression has the "right" features to explain the observed data, but during COVID19, (or at least the start of it) you see a large increase in the GARCH std
because the model "doesn't know" how to explain the observed data (the features don't explain it).
I just want to ensure that I am proceeding on sound footing here. (The reasoning behind a lot of this is that I am in the process of converting my Stan model into PYMC, and want to replicate what I have in loops in Stan as scan
in the PYMC code)
That said, I really appreciate your comments!
Ah ok! Yeah it definitely makes sense to have some kind of COVID adjustment. I saw a paper on LinkedIn that did this kind of adjustment in a Bayesian VAR (I dug around for the post but couldn't find it). Basically they just used an indicator variable for COVID/not COVID, then modeled the residuals of that regression with a normal VAR. You would basically be doing the same thing in a GARCH framework by having your exog_data
be a COVID indicator?
So we had no indicator originally. And COVID is a really simple example (we absolutely knew what was going on) but generically, if you see large std
in certain time regimes, it is a motivator to go see why the std
is growing (i.e. why aren't the features capturing the variance?). Then you can apply the COVID indicator (for example) and (hopefully) see the variance diminish during the same time period (when retrained). Then you can be confident that the uncaptured variance was in fact due to COVID (and not a bunch of other things).
Alright, sorry to keep pestering you @jessegrabowski but one last question (maybe):
Here is my implementation, but I don't think things are correct for some reason (actually for two reasons: when I try to run with JAX
backend, it is complaining about numpy generators, and if I don't run with JAX
, it is using a bunch of samplers other than NUTS for different parameters (which should all be able to be sampled by NUTS since they are continuous)).
Anyway, here is the relevant part of the model (I dropped the part that computes obs_mean
):
# GARCH parameters
# Inital values for scan
sigma_sq_init = pm.Exponential('sigma_sq_init', 1)
epsilon_init = pm.Normal('epsilon_init')
omega = pm.Uniform('omega',0,1)
alpha1 = pm.Uniform('alpha1',0,1)
beta1 = pm.Uniform('beta1',0, (1-alpha))
def step(*args):
epsilon_tm1, sigma_sq_tm1, omega, alpha1, beta1 = args
# GARCH process
sig2 = omega + alpha * at.square(epsilon_tm1) + beta1 * sigma_sq_tm1
return sig2, collect_default_updates(args, [sig2])
sigmas, updates = pytensor.scan(
fn=step,
sequences=[{'input':at.concatenate([epsilon_init[None], (obs_data-obs_mean)]), 'taps':[-1]}],
outputs_info=[sigma_sq_init],
non_sequences=[omega, alpha1, beta1],
strict=True,
mode = get_mode(None) # use get_mode("JAX") if you try to sample with numpyro
)
sig = pm.Deterministic('sig',at.sqrt(sigmas))
lik = pm.Normal(name='lik', mu=obs_mean, sigma=sig, observed=obs_data, shape=obs_data.shape)
where obs_data
is the observed data and obs_mean
is the calculated mean from the model (same shape as obs_data
).
Can you see anything amiss here?
They way you've written it, I don't think you need to collect updates at all in the scan function. This is because you're not actually making any random variables inside the scan, so you don't need to do anything special with the underlying random number generator (this is what collect_default_updates
does.
Oh I guess the way to phrase this is that I know epsilon for all of the time steps and need to just compute sigma from the scan.