$$ A'= \begin{bmatrix}
-
$\phi$ : sine/cosine frequency -
$p$ : token position -
$B$ : the base constant, 10000 by default -
$m$ : dimension pair index ($2m$ for even and$2m+1$ for odd) -
$d$ : dimension size of the model or one attention head
- Q: query matrix of shape (batch_size, seq_len,
$d_q$ ) - K: key matrix of shape (batch_size, seq_len,
$d_k$ ) - V: value matrix of shape (batch_size, seq_len,
$d_v$ ) - M: Mask matrix (seq_len, seq_len), 0 for masked positions and 1 for allowed positions
- D: Dropout (seq_len, seq_len), let p be the dropout probability, each element x_i has probability p to be set to 0, and probability 1-p to be kept and scaled up by 1/(1-p) to compensate for the removed units and preserve the expected sum
- source: Meta's Llama 3.1 paper The Llama 3 Heard of Models
- Benchmark details generated by ChatGPT using GPT-4o model
| Category | Benchmark | Full Name | Authors/Institution | Description | Example |
|---|---|---|---|---|---|
| Reading Comprehension | SQuAD V2 (2018) | Stanford Question Answering Dataset 2.0 | Pranav Rajpurkar et al., Stanford University | Combines 100,000 questions from SQuAD 1.1 with 50,000 unanswerable questions. | "When were the Normans in Normandy?" Answer: "10th and 11th centuries". |
NewerOlder