Created
April 2, 2020 22:23
-
-
Save myleott/cdf685b8b3ce20b0221e1842782bce74 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
text: a b c </s> d e f g </s> | |
Suppose the model is trained with a context length of 4. | |
Then the most favorable way to evaluate your model's perplexity is: | |
batch 1: a b c </s> | |
|----------| <-- count perplexity of this | |
batch 2: b c </s> d | |
|-| <-- count perplexity of this | |
batch 3: c </s> d e | |
|-| <-- count perplexity of this | |
batch 4: </s> d e f | |
|-| <-- count perplexity of this | |
batch 5: d e f g | |
|-| <-- count perplexity of this | |
batch 6: e f g </s> | |
|----| <-- count perplexity of this |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To see why this is correct, consider that we typically decompose the probability of a sequence
y
autoregressively:p(y) = \prod_i p(y_i | y_{i-1}, ..., y_0)
But in practice we then evaluate the model with a sliding window, so you get something like:
p(y_7|y_6,y_5,y_4) * p(y_6|y_5,y_4) * p(y_5|y_4) * p(y_4) * p(y_3|y_2,y_1,y_0) * p(y_2|y_1,y_0) * p(y_1|y_0) * p(y_0)
This is unfavorable since the model has access to less context. By using a stride of 1, the model has access to more context and you'll get more favorable perplexities.