Skip to content

Instantly share code, notes, and snippets.

@jessegrabowski
Last active May 14, 2023 11:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jessegrabowski/ccda08b8a758f882f5794b8b89ace07a to your computer and use it in GitHub Desktop.
Save jessegrabowski/ccda08b8a758f882f5794b8b89ace07a to your computer and use it in GitHub Desktop.
ARMA-GARCH in PyMC
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mathDR
Copy link

mathDR commented May 11, 2023

This is great! I have a question though: How would you change the ARMA(1,1)-GARCH(1,1) example with a mean that is calculated for the full time?

In particular, you can think of computing mu as the full vector over all time steps and you want to pass that as a sequence into the scan function (along with the observed values). Do I just pass two sequences and their respective taps?

@mathDR
Copy link

mathDR commented May 11, 2023

Oh I guess the way to phrase this is that I know epsilon for all of the time steps and need to just compute sigma from the scan.

@jessegrabowski
Copy link
Author

Yes, if you wanted to estimate the mean outside the scan (for example with a random walk), you would pass it as a sequence. You never know epsilon though, it is by definition unobserved. If you observe something, it's some other exogenous regressor, not epsilon.

@mathDR
Copy link

mathDR commented May 11, 2023

Yes, given your formulation, I do not know epsilon. I was following the exposition from the Stan Users Guide here. Which basically substitutes your epsilon with the squared previous difference in return from the mean at the previous time step. This (I think?) is a more general statement than the ARMA(1,1) formulation you derived above, correct?

@jessegrabowski
Copy link
Author

Their model is the same as my model. I would say mine is more general because I add a statement about the distribution of the innovations, whereas they tacitly assume they are normally distributed. Anyway, if you look at how sigma is updated:

  sigma[t] = sqrt(alpha0
                   + alpha1 * pow(r[t - 1] - mu, 2)
                   + beta1 * pow(sigma[t - 1], 2));

In this STAN code, mu is a time-invariant scalar value (exactly the same as my first example), so r[t-1] - mu is equal to my epsilon_tm1, because at each time step I compute epsilon = y_true - mu (in STAN syntax, y[t] - mu), then scan passes this value along to the next time step (because it is an outputs_info variable), where it becomes epsilon_tm1.

Note that if mu were time varying in the STAN example, they would have needed to index it asmu[t-1] as well, so you'd end up in the epsilon_tm1 for my second example.

Let's be clear that "epsilon" is a statistical fiction. It's the residual variance in the data that isn't explained by your mean model. It has nothing to do with my formulation (I'm not that special, sadly) -- by definition, you never observe error (or "innovations" in the time series literature) in statistical models

@mathDR
Copy link

mathDR commented May 11, 2023

Ah yes, thank you for the clarification. This is much clearer in my mind now. Your second comment is exactly why I stated we would "know" the epsilons when we compute a time varying mu outside of the scan. Philosophically, I agree that you do not observe the error.

Practically though, If I compute

# Heteroscedastic variance
observed is a vector of length num_dates
Estimate mu:  a time varying estimate of the observed mean (also of length num_dates)
Set: std[0] = 1.0;
    for t in range(1,num_dates):
        std[t] = sqrt(omega + alpha1 * square(observed[t-1] - mu[t-1]) + beta1 * square(std[t-1]));

Then std is capturing the volatility of the model (i.e. the errors that mu is not absorbing when regressing observed).

Does this interpretation make sense?

in particular, if I use some type of logistic regression (or something else) to produce mu, then I would like std to capture those times that my features do not explain observed. To your point, there may be some underlying baseline error that is not captured at all by the features, so that would be a lower bound on the size of std.

Maybe this is an abuse of the GARCH model formulation to think of it like this?

@jessegrabowski
Copy link
Author

No you're spot on. I see where we're differing in nomenclature now. The second example is very close to what you're thinking about, some kind of regression model with GARCH errors. The model was ARIMA there, but there's no reason why it couldn't be anything you like. For example, we could have some exogenous indicators, and use a linear regression for the mean model:

    def step(*args):
        # It's not necessary to do this args thing, but we will need to pass all of these to 
        # collect_default_updates in the end, so it makes it a bit more compact.
        y_true, X_exog, sigma_sq_tm1, epsilon_tm1, beta_exog, ω, α, β, μ = args
        
        # E[y] model as a linear regression
        μ = μ + X_exog @ beta_exog
        
        # GARCH process
        σ2 = ω + α * epsilon_tm1 ** 2 + β * sigma_sq_tm1        
        
        # y = E[y] + epsilon implies y ~ N(E[y], sigma)
        y = pm.Normal.dist(mu=μ, sigma=pt.sqrt(σ2))
        
        # back out the value of epsilon from y and mu
        epsilon = y_true - μ
        
        return (σ2, epsilon, y), collect_default_updates(args, [y])

    [sigmas, epsilons, y_hat], updates = pytensor.scan(
        fn=step,
        sequences=[data, exog_data], # <-- note the new data. Pay attention to timing though; here's im saying the exog_data operates on y contemporaneously (i.e. E[y_t] = mu + beta X_t) as opposed to with a lag. That's a modeling choice!
        outputs_info=[sigma_sq_init, epsilon_init, None],
        non_sequences=[beta_exog, omega, alpha, beta, mu], #<-- Note the inclusion of a new vector of parameters for the linear mean model
        strict=True,
        mode = get_mode(None) # use get_mode("JAX") if you try to sample with numpyro
    )

In principle I don't see any reason why you couldn't write anything you want for the mean model, including adding a link function for e.g. logistic regression. You could put a whole neural network there if you were really feeling frisky.

Another practical question I think you're asking is, can I just do the whole mean regression first then run the GARCH model directly on the residuals of that regression, or must I do everything together in steps? There might be an algebraic equivalence between the two, you would have to sit down and carefully look at it. I know that there are subtleties in this area, see this note on how to write a SARIMAX regression for some idea of what can "go wrong".

Personally I will recommend doing it the way I've presented because I know 100% that it's right. But if you experiment and find a way that is faster or more clear, please report back so I can do it too!

@jessegrabowski
Copy link
Author

jessegrabowski commented May 11, 2023

If you did do the regression first, the mean of the residuals will be zero by construction, so you could just omit the mean model from inside the scan entirely and replace mu with 0 everywhere. You wouldn't need the 0th tap for y_true, you could just use the -1 tap as epsilon_tm1 directly. There would be some other subtleties with setting up the likelihood too, because you can't use the residuals from the first regression as observed in the second regression. You would need to provide the actual data as observed, with mu=regression_mu and sigma=GARCH_sigmas. Not positive, you'll have to play around with it

@mathDR
Copy link

mathDR commented May 11, 2023

Yes, the idea is to model both the regression and the GARCH sigmas simultaneously. Then the model "ideally" attempts to explain as much as possible the observed data with the regression, and then (hopefully) uses the GARCH sigma to absorb any unexplained variance.

I have had good success with this type of model (in particular for say modeling sales at a retailer): prior to COVID19, the GARCH sigmas are really small, since the regression has the "right" features to explain the observed data, but during COVID19, (or at least the start of it) you see a large increase in the GARCH std because the model "doesn't know" how to explain the observed data (the features don't explain it).

I just want to ensure that I am proceeding on sound footing here. (The reasoning behind a lot of this is that I am in the process of converting my Stan model into PYMC, and want to replicate what I have in loops in Stan as scan in the PYMC code)

That said, I really appreciate your comments!

@jessegrabowski
Copy link
Author

jessegrabowski commented May 11, 2023

Ah ok! Yeah it definitely makes sense to have some kind of COVID adjustment. I saw a paper on LinkedIn that did this kind of adjustment in a Bayesian VAR (I dug around for the post but couldn't find it). Basically they just used an indicator variable for COVID/not COVID, then modeled the residuals of that regression with a normal VAR. You would basically be doing the same thing in a GARCH framework by having your exog_data be a COVID indicator?

@mathDR
Copy link

mathDR commented May 11, 2023

So we had no indicator originally. And COVID is a really simple example (we absolutely knew what was going on) but generically, if you see large std in certain time regimes, it is a motivator to go see why the std is growing (i.e. why aren't the features capturing the variance?). Then you can apply the COVID indicator (for example) and (hopefully) see the variance diminish during the same time period (when retrained). Then you can be confident that the uncaptured variance was in fact due to COVID (and not a bunch of other things).

@mathDR
Copy link

mathDR commented May 12, 2023

Alright, sorry to keep pestering you @jessegrabowski but one last question (maybe):
Here is my implementation, but I don't think things are correct for some reason (actually for two reasons: when I try to run with JAX backend, it is complaining about numpy generators, and if I don't run with JAX, it is using a bunch of samplers other than NUTS for different parameters (which should all be able to be sampled by NUTS since they are continuous)).

Anyway, here is the relevant part of the model (I dropped the part that computes obs_mean):

# GARCH parameters
    # Inital values for scan
    sigma_sq_init = pm.Exponential('sigma_sq_init', 1)
    epsilon_init = pm.Normal('epsilon_init')

    omega = pm.Uniform('omega',0,1)
    alpha1 = pm.Uniform('alpha1',0,1)
    beta1 = pm.Uniform('beta1',0, (1-alpha))
    
    def step(*args):
        epsilon_tm1, sigma_sq_tm1, omega, alpha1, beta1 = args
        
        # GARCH process
        sig2 = omega + alpha * at.square(epsilon_tm1) + beta1 * sigma_sq_tm1        
        
        return sig2, collect_default_updates(args, [sig2])

    sigmas, updates = pytensor.scan(
        fn=step,
        sequences=[{'input':at.concatenate([epsilon_init[None], (obs_data-obs_mean)]), 'taps':[-1]}],
        outputs_info=[sigma_sq_init],
        non_sequences=[omega, alpha1, beta1],
        strict=True,
        mode = get_mode(None) # use get_mode("JAX") if you try to sample with numpyro
    )
    sig = pm.Deterministic('sig',at.sqrt(sigmas))
    lik = pm.Normal(name='lik', mu=obs_mean, sigma=sig, observed=obs_data, shape=obs_data.shape)

where obs_data is the observed data and obs_mean is the calculated mean from the model (same shape as obs_data).

Can you see anything amiss here?

@jessegrabowski
Copy link
Author

They way you've written it, I don't think you need to collect updates at all in the scan function. This is because you're not actually making any random variables inside the scan, so you don't need to do anything special with the underlying random number generator (this is what collect_default_updates does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment