Xiaoli Shen furixturi

## RoPE_rotation_matrix.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / RoPE_rotation_matrix.md
            
            
              Last active
              September 15, 2025 05:22
            
          
    $$
\begin{bmatrix}
\cos(\phi) & -\sin(\phi)\\
\sin(\phi) & \cos(\phi)
\end{bmatrix}
$$
$$
A'=
\begin{bmatrix}

  
## RoPE_frequency.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / RoPE_frequency.md
            
            
              Created
              September 15, 2025 03:14
            
          
    $\quad\quad \phi_m(p) = p \cdot B^{-\frac{2m}{d}}$


$\phi$: sine/cosine frequency

$p$: token position

$B$: the base constant, 10000 by default

$m$: dimension pair index ($2m$ for even and $2m+1$ for odd)

$d$: dimension size of the model or one attention head


## sinusoidal_position_encoding_division_term.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / sinusoidal_position_encoding_division_term.md
            
            
              Created
              September 15, 2025 02:28
            
          
    $$
\frac{1}{10000^{\frac{2i}{d}}} = 10000^{-\frac{2i}{d}} = e^{ln(10000^{(-\frac{2i}{d})})}
= e^{-\frac{2i}{d}ln10000}
$$

  
## sinusoidal_position_encoding.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / sinusoidal_position_encoding.md
            
            
              Created
              September 12, 2025 06:19
            
          
    $$
\text{PE}{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
$$
$$
\text{PE}{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
$$

  
## RMSNorm.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / RMSNorm.md
            
            
              Created
              August 20, 2025 13:56
            
          
    $$
RN(x_i) = \frac{x_i}{RMS(x)}g_i, \quad \text{RMS}(a) = \sqrt{\frac{1}{n}\sum_{i=1}^n a_i^2}
$$
$g \in \mathbb{R}^n$ is a learned scaling parameter

  
## mla_formula.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / mla_formula.md
            
            
              Created
              August 5, 2025 12:02
            
          
    $$\$\$
\begin{align}
c_t^{Q} = W^{DQ} h_t  \\\\  % (1)
[q_{t,1}^{C}; q_{t,2}^{C}; \ldots; q_{t,n_h}^{C}] = q_t^{C} = W^{UQ} c_t^{Q}  \\\\  % (2)
[q_{t,1}^{R}; q_{t,2}^{R}; \ldots; q_{t,n_h}^{R}] = q_t^{R} = \mathrm{RoPE}(W^{QR} c_t^{Q})  \\\\  % (3)
q_{t,i} = [ q_{t,i}^{C}; q_{t,i}^{R} ]  \\\\  % (4)
c_t^{KV} = W^{DKV} h_t  \\\\  % (5)
[k_{t,1}^{C}; k_{t,2}^{C}; \ldots; k_{t,n_h}^{C}] = k_t^{C} = W^{UK} c_t^{KV}  \\\\  % (6)
k_t^{R} = \mathrm{RoPE}(W^{KR} h_t) \\\\ % (7)$$

  
## masked_scaled_dot_product_attn_dropout.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / masked_scaled_dot_product_attn_dropout.md
            
            
              Created
              July 28, 2025 23:43
            
              
                masked scaled dot-product with dropout formula
              
          
    $Attention(Q, K, V, M, D) = \left[D \odot Softmax\left(\frac{QK^T}{\sqrt{d_k}} + mask(M)\right)\right]V$

Q: query matrix of shape (batch_size, seq_len, $d_q$)
K: key matrix of shape (batch_size, seq_len, $d_k$)
V: value matrix of shape (batch_size, seq_len, $d_v$)
M: Mask matrix (seq_len, seq_len), 0 for masked positions and 1 for allowed positions
D: Dropout (seq_len, seq_len), let p be the dropout probability, each element x_i has probability p to be set to 0, and probability 1-p to be kept and scaled up by 1/(1-p) to compensate for the removed units and preserve the expected sum


## dot_product_attention_formula.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / dot_product_attention_formula.md
            
            
              Created
              July 28, 2025 06:51
            
          
    $$Attention(Q,K,V)=Softmax(\frac{QK^T}{\sqrt{d_k}})V$$

  
## benchmarks.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / benchmarks.md
            
            
              Last active
              August 4, 2024 10:47
            
              
                benchmarks-llama-3_1
              
          
    Benchmarks used to evaluate Llama 3.1


source: Meta's Llama 3.1 paper The Llama 3 Heard of Models
Benchmark details generated by ChatGPT using GPT-4o model

Pre-training


Category
Benchmark
Full Name
Authors/Institution
Description
Example


Reading Comprehension
SQuAD V2 (2018)
Stanford Question Answering Dataset 2.0
Pranav Rajpurkar et al., Stanford University
Combines 100,000 questions from SQuAD 1.1 with 50,000 unanswerable questions.
"When were the Normans in Normandy?" Answer: "10th and 11th centuries".


## git-wrong-user.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                furixturi
                / git-wrong-user.md
            
            
              Created
              July 23, 2020 15:20
            
          
    Check who is authenticated with github

ssh -T git@github.com
Temporarily switch user and reauthenticate

git config credential.username "otherName"