Skip to content

Instantly share code, notes, and snippets.

@myleott
Created April 2, 2020 22:23
Show Gist options
  • Save myleott/cdf685b8b3ce20b0221e1842782bce74 to your computer and use it in GitHub Desktop.
Save myleott/cdf685b8b3ce20b0221e1842782bce74 to your computer and use it in GitHub Desktop.
text: a b c </s> d e f g </s>
Suppose the model is trained with a context length of 4.
Then the most favorable way to evaluate your model's perplexity is:
batch 1: a b c </s>
|----------| <-- count perplexity of this
batch 2: b c </s> d
|-| <-- count perplexity of this
batch 3: c </s> d e
|-| <-- count perplexity of this
batch 4: </s> d e f
|-| <-- count perplexity of this
batch 5: d e f g
|-| <-- count perplexity of this
batch 6: e f g </s>
|----| <-- count perplexity of this
@myleott
Copy link
Author

myleott commented Apr 2, 2020

To see why this is correct, consider that we typically decompose the probability of a sequence y autoregressively:

p(y) = \prod_i p(y_i | y_{i-1}, ..., y_0)

But in practice we then evaluate the model with a sliding window, so you get something like:

p(y_7|y_6,y_5,y_4) * p(y_6|y_5,y_4) * p(y_5|y_4) * p(y_4) * p(y_3|y_2,y_1,y_0) * p(y_2|y_1,y_0) * p(y_1|y_0) * p(y_0)

This is unfavorable since the model has access to less context. By using a stride of 1, the model has access to more context and you'll get more favorable perplexities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment