This proposal outlines a method to augment an autoregressive Transformer (e.g., GPT) with multi-horizon probabilistic priors derived from external Markov models or a similar statistical basis system. Instead of modifying the architecture, the method uses auxiliary layer-wise losses to align each layer’s internal representation with a synthetic embedding derived from the Markov transition probabilities.
The idea is to teach the model how to utilize prior knowledge to arrive at the most likely futures at multiple temporal horizons and therefore to localize discovery to relevant layers while maintaining compatibility with standard next-token training.