Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Last active June 7, 2024 00:07
Show Gist options
  • Save bigsnarfdude/6e583d5470408a35e88413e0d029227b to your computer and use it in GitHub Desktop.
Save bigsnarfdude/6e583d5470408a35e88413e0d029227b to your computer and use it in GitHub Desktop.
llm.c-fineweb.md

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:05:00.0 Off |                  Off |
| 42%   72C    P2             296W / 300W |  44883MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               On  | 00000000:06:00.0 Off |                  Off |
| 32%   64C    P2             295W / 300W |  44883MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               On  | 00000000:07:00.0 Off |                  Off |
| 42%   72C    P2             297W / 300W |  44883MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               On  | 00000000:08:00.0 Off |                  Off |
| 36%   67C    P2             297W / 300W |  44883MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+


# install miniconda
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh
~/miniconda3/bin/conda init bash
source ~/.bashrc

# pytorch nightly (optional) https://pytorch.org/get-started/locally/
# conda install --yes pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch-nightly -c nvidia

# pip installs so we can tokenize the FineWeb dataset
yes | pip install tqdm tiktoken requests datasets

# install cudnn so we can use FlashAttention and run fast (optional)
# https://developer.nvidia.com/cudnn-downloads
# for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-ubuntu2204-9.1.1_1.0-1_amd64.deb
sudo dpkg -i cudnn-local-repo-ubuntu2204-9.1.1_1.0-1_amd64.deb
sudo cp /var/cudnn-local-repo-ubuntu2204-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cudnn-cuda-12

# "install" cudnn-frontend to ~/
git clone https://github.com/NVIDIA/cudnn-frontend.git

# https://gist.github.com/bigsnarfdude/6ac9e0a2dd320a22cdfb5ad34f3dd2eb

# install MPI (optional, if you intend to use multiple GPUs)
sudo apt install openmpi-bin openmpi-doc libopenmpi-dev

# nccl
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update

# tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
# writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
# and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
git clone https://github.com/karpathy/llm.c.git
cd llm.c
python dev/data/fineweb.py --version 10B

# compile llm.c (mixed precision, with cuDNN flash-attention)
# first compilation is ~1 minute, mostly due to cuDNN
make train_gpt2cu USE_CUDNN=1

# train on a single GPU
./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1

# if you have multiple GPUs (e.g. 8), simply prepend the mpi command, e.g.:
# mpirun -np 8 ./train_gpt2cu \ ... (the rest of the args are same)

mpirun -np 4 ./train_gpt2cu \
    -i "dev/data/fineweb10B/fineweb_train_*.bin" \
    -j "dev/data/fineweb10B/fineweb_val_*.bin" \
    -o log124M \
    -e "d12" \
    -b 64 -t 1024 \
    -d 524288 \
    -r 1 \
    -z 1 \
    -c 0.1 \
    -l 0.0006 \
    -q 0.0 \
    -u 700 \
    -n 5000 \
    -v 250 -s 20000 \
    -h 1

val loss 11.010830
allocating 237 MiB for parameter gradients
allocating 2016 MiB for activation gradients
allocating 118 MiB for AdamW optimizer state m
allocating 118 MiB for AdamW optimizer state v
allocating 118 MiB for master copy of params
step    1/18865 | train loss 11.011719 | norm 15.4259 | lr 8.57e-07 | 1769.62 ms | 20.4% A100 fp16 MFU | 296272 tok/s
step    2/18865 | train loss 10.958822 | norm 15.6849 | lr 1.71e-06 | 1764.43 ms | 20.5% A100 fp16 MFU | 297143 tok/s
step    3/18865 | train loss 10.855148 | norm 14.7419 | lr 2.57e-06 | 1758.23 ms | 20.6% A100 fp16 MFU | 297681 tok/s
step    4/18865 | train loss 10.716377 | norm 13.0236 | lr 3.43e-06 | 1762.10 ms | 20.5% A100 fp16 MFU | 297630 tok/s
step    5/18865 | train loss 10.569969 | norm 10.5271 | lr 4.29e-06 | 1765.19 ms | 20.5% A100 fp16 MFU | 297464 tok/s
step    6/18865 | train loss 10.429670 | norm 8.1774 | lr 5.14e-06 | 1759.93 ms | 20.5% A100 fp16 MFU | 297561 tok/s
step    7/18865 | train loss 10.305890 | norm 7.2229 | lr 6.00e-06 | 1763.80 ms | 20.5% A100 fp16 MFU | 297502 tok/s
step    8/18865 | train loss 10.199904 | norm 6.4897 | lr 6.86e-06 | 1759.03 ms | 20.5% A100 fp16 MFU | 297594 tok/s
step    9/18865 | train loss 10.100016 | norm 5.5020 | lr 7.71e-06 | 1763.90 ms | 20.5% A100 fp16 MFU | 297540 tok/s
step   10/18865 | train loss 9.996247 | norm 4.5547 | lr 8.57e-06 | 1761.65 ms | 20.5% A100 fp16 MFU | 297550 tok/s
step   11/18865 | train loss 9.919165 | norm 3.9649 | lr 9.43e-06 | 1762.47 ms | 20.5% A100 fp16 MFU | 297540 tok/s
step   12/18865 | train loss 9.858116 | norm 3.3737 | lr 1.03e-05 | 1764.57 ms | 20.5% A100 fp16 MFU | 297492 tok/s
step   13/18865 | train loss 9.837772 | norm 3.0338 | lr 1.11e-05 | 1762.02 ms | 20.5% A100 fp16 MFU | 297498 tok/s
step   14/18865 | train loss 9.751990 | norm 2.9455 | lr 1.20e-05 | 1763.94 ms | 20.5% A100 fp16 MFU | 297470 tok/s
step   15/18865 | train loss 9.725657 | norm 2.5369 | lr 1.29e-05 | 1762.26 ms | 20.5% A100 fp16 MFU | 297474 tok/s
...
step  745/18865 | train loss 4.826040 | norm 0.8670 | lr 6.00e-04 | 1787.94 ms | 20.2% A100 fp16 MFU | 293293 tok/s
step  746/18865 | train loss 4.752148 | norm 0.7692 | lr 6.00e-04 | 1787.43 ms | 20.2% A100 fp16 MFU | 293294 tok/s
step  747/18865 | train loss 4.724916 | norm 0.7083 | lr 6.00e-04 | 1787.75 ms | 20.2% A100 fp16 MFU | 293293 tok/s
step  748/18865 | train loss 4.771627 | norm 0.7137 | lr 6.00e-04 | 1787.93 ms | 20.2% A100 fp16 MFU | 293290 tok/s
step  749/18865 | train loss 4.737040 | norm 0.9367 | lr 6.00e-04 | 1787.76 ms | 20.2% A100 fp16 MFU | 293289 tok/s
step  750/18865 | train loss 4.713876 | norm 0.9464 | lr 6.00e-04 | 1787.21 ms | 20.2% A100 fp16 MFU | 293292 tok/s
val loss 4.728277
HellaSwag: 2454/10042 = 0.244374
step  751/18865 | train loss 4.728142 | norm 0.7450 | lr 6.00e-04 | 1792.46 ms | 20.2% A100 fp16 MFU | 293252 tok/s
step  752/18865 | train loss 4.717860 | norm 0.7278 | lr 6.00e-04 | 1789.22 ms | 20.2% A100 fp16 MFU | 293241 tok/s
step  753/18865 | train loss 4.682795 | norm 0.8325 | lr 6.00e-04 | 1788.70 ms | 20.2% A100 fp16 MFU | 293235 tok/s
step  754/18865 | train loss 4.803375 | norm 0.7246 | lr 6.00e-04 | 1788.11 ms | 20.2% A100 fp16 MFU | 293233 tok/s
step  755/18865 | train loss 4.926927 | norm 0.7095 | lr 6.00e-04 | 1790.59 ms | 20.2% A100 fp16 MFU | 293212 tok/s
step  756/18865 | train loss 4.723626 | norm 0.7115 | lr 6.00e-04 | 1790.01 ms | 20.2% A100 fp16 MFU | 293196 tok/s
step  757/18865 | train loss 4.791690 | norm 0.7453 | lr 6.00e-04 | 1790.27 ms | 20.2% A100 fp16 MFU | 293179 tok/s
step  758/18865 | train loss 4.775302 | norm 0.8514 | lr 6.00e-04 | 1789.35 ms | 20.2% A100 fp16 MFU | 293170 tok/s
...
step 1496/18865 | train loss 4.063164 | norm 0.4488 | lr 5.97e-04 | 1792.64 ms | 20.2% A100 fp16 MFU | 293118 tok/s
step 1497/18865 | train loss 4.000433 | norm 0.4875 | lr 5.97e-04 | 1788.32 ms | 20.2% A100 fp16 MFU | 293120 tok/s
step 1498/18865 | train loss 3.953322 | norm 0.5040 | lr 5.97e-04 | 1789.49 ms | 20.2% A100 fp16 MFU | 293114 tok/s
step 1499/18865 | train loss 3.958956 | norm 0.5696 | lr 5.97e-04 | 1790.11 ms | 20.2% A100 fp16 MFU | 293102 tok/s
step 1500/18865 | train loss 4.075638 | norm 0.4518 | lr 5.97e-04 | 1789.13 ms | 20.2% A100 fp16 MFU | 293099 tok/s
val loss 4.011482
HellaSwag: 2587/10042 = 0.257618
step 1501/18865 | train loss 3.976398 | norm 0.4597 | lr 5.97e-04 | 1790.77 ms | 20.2% A100 fp16 MFU | 293082 tok/s
step 1502/18865 | train loss 3.988251 | norm 0.4175 | lr 5.97e-04 | 1786.17 ms | 20.2% A100 fp16 MFU | 293105 tok/s
step 1503/18865 | train loss 3.992924 | norm 0.3919 | lr 5.97e-04 | 1787.54 ms | 20.2% A100 fp16 MFU | 293114 tok/s
step 1504/18865 | train loss 3.979673 | norm 0.3806 | lr 5.97e-04 | 1788.48 ms | 20.2% A100 fp16 MFU | 293116 tok/s
...
step 1745/18865 | train loss 3.980585 | norm 0.4336 | lr 5.95e-04 | 1789.78 ms | 20.2% A100 fp16 MFU | 293138 tok/s
step 1746/18865 | train loss 3.940175 | norm 0.4115 | lr 5.95e-04 | 1789.47 ms | 20.2% A100 fp16 MFU | 293130 tok/s
step 1747/18865 | train loss 3.909281 | norm 0.3669 | lr 5.95e-04 | 1787.57 ms | 20.2% A100 fp16 MFU | 293139 tok/s
step 1748/18865 | train loss 3.880344 | norm 0.3804 | lr 5.95e-04 | 1787.91 ms | 20.2% A100 fp16 MFU | 293144 tok/s
step 1749/18865 | train loss 3.918893 | norm 0.3939 | lr 5.95e-04 | 1786.39 ms | 20.2% A100 fp16 MFU | 293161 tok/s
step 1750/18865 | train loss 3.893871 | norm 0.3602 | lr 5.95e-04 | 1788.52 ms | 20.2% A100 fp16 MFU | 293160 tok/s
val loss 3.924333
HellaSwag: 2630/10042 = 0.261900
step 1751/18865 | train loss 3.925871 | norm 0.3626 | lr 5.95e-04 | 1793.60 ms | 20.1% A100 fp16 MFU | 293117 tok/s
step 1752/18865 | train loss 3.919524 | norm 0.3941 | lr 5.95e-04 | 1788.32 ms | 20.2% A100 fp16 MFU | 293120 tok/s
step 1753/18865 | train loss 3.999953 | norm 0.4085 | lr 5.95e-04 | 1788.87 ms | 20.2% A100 fp16 MFU | 293118 tok/s
step 1754/18865 | train loss 4.027669 | norm 0.3918 | lr 5.95e-04 | 1788.79 ms | 20.2% A100 fp16 MFU | 293117 tok/s
step 1755/18865 | train loss 3.896527 | norm 0.3536 | lr 5.95e-04 | 1790.27 ms | 20.2% A100 fp16 MFU | 293104 tok/s
step 1756/18865 | train loss 3.930465 | norm 0.3642 | lr 5.95e-04 | 1789.14 ms | 20.2% A100 fp16 MFU | 293101 tok/s
...
step 18861/18865 | train loss 3.276321 | norm 0.2545 | lr 1.07e-10 | 6091.70 ms | 23.7% A100 fp16 MFU | 86071 tok/s
step 18862/18865 | train loss 3.295762 | norm 0.2116 | lr 7.15e-11 | 6093.72 ms | 23.7% A100 fp16 MFU | 86070 tok/s
step 18863/18865 | train loss 3.283065 | norm 0.2151 | lr 3.58e-11 | 6087.44 ms | 23.7% A100 fp16 MFU | 86073 tok/s
step 18864/18865 | train loss 3.275241 | norm 0.2181 | lr 1.79e-11 | 6094.07 ms | 23.7% A100 fp16 MFU | 86071 tok/s
step 18865/18865 | train loss 3.298655 | norm 0.2109 | lr 0.00e+00 | 6092.45 ms | 23.7% A100 fp16 MFU | 86070 tok/s
val loss 3.296501
HellaSwag: 3013/10042 = 0.300040
generating:
---
The She was blown away in the bar and front bar
View All My Photos
Learn more about her. Read about the director in today's Denon Walters-Lippes' . The incident occurred before her wedding. Very poor judgment on the part of the baker and proof that a woman killed her fiancé
---
Writing model to log124M/model_00018865.bin
Writing state to log124M/state_00018865_00000.bin
total average iteration time: 6086.300854 ms
@bigsnarfdude
Copy link
Author

bigsnarfdude commented May 30, 2024


./train_gpt2cu     -i "dev/data/fineweb10B/fineweb_train_*.bin"     -j "dev/data/fineweb10B/fineweb_val_*.bin"     -o log124M     -e "d12"     -b 16 -t 1024     -d 524288     -r 1     -z 1     -c 0.1     -l 0.0006     -q 0.0     -u 700     -n 5000     -v 250 -s 20000     -h 1
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/fineweb10B/fineweb_train_*.bin            |
| val data pattern      | dev/data/fineweb10B/fineweb_val_*.bin              |
| output log dir        | log124M                                            |
| checkpoint_every      | 5000                                               |
| resume                | 0                                                  |
| micro batch size B    | 16                                                 |
| sequence length T     | 1024                                               |
| total batch size      | 524288                                             |
| learning rate (LR)    | 6.000000e-04                                       |
| warmup iterations     | 700                                                |
| final LR fraction     | 0.000000e+00                                       |
| weight decay          | 1.000000e-01                                       |
| max_steps             | -1                                                 |
| val_loss_every        | 250                                                |
| val_max_steps         | 20                                                 |
| sample_every          | 20000                                              |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA GeForce RTX 4070 Ti SUPER                   |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
| load_filename         | d12                                                |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
| train_num_batches     | 18865                                              |
| val_num_batches       | 20                                                 |
+-----------------------+----------------------------------------------------+
| run hellaswag         | no                                                 |
+-----------------------+----------------------------------------------------+
| Zero Stage1 is enabled                                                     |
| num_processes         | 1                                                  |
| zero_stage            | 1                                                  |
+-----------------------+----------------------------------------------------+
HellaSwag eval not found at dev/data/hellaswag/hellaswag_val.bin, skipping its evaluation
You can run `python dev/data/hellaswag.py` to export and use it with `-h 1`.
num_parameters: 124475904 => bytes: 248951808
allocated 237 MiB for model parameters
batch_size B=16 * seq_len T=1024 * num_processes=1 and total_batch_size=524288
=> setting grad_accum_steps=32
allocating 5758 MiB for activations
val loss 11.016012
allocating 237 MiB for parameter gradients
allocating 120 MiB for activation gradients
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params
step    1/18865 | train loss 11.011686 | norm 15.3362 | lr 8.57e-07 | 6317.08 ms | 22.9% A100 fp16 MFU | 82995 tok/s
step    2/18865 | train loss 10.958843 | norm 15.1343 | lr 1.71e-06 | 6095.28 ms | 23.7% A100 fp16 MFU | 86015 tok/s
step    3/18865 | train loss 10.855253 | norm 14.6903 | lr 2.57e-06 | 6094.81 ms | 23.7% A100 fp16 MFU | 86019 tok/s
step    4/18865 | train loss 10.715992 | norm 13.0509 | lr 3.43e-06 | 6095.01 ms | 23.7% A100 fp16 MFU | 86019 tok/s
step    5/18865 | train loss 10.569792 | norm 10.4649 | lr 4.29e-06 | 6094.56 ms | 23.7% A100 fp16 MFU | 86021 tok/s
step    6/18865 | train loss 10.429631 | norm 8.3931 | lr 5.14e-06 | 6094.32 ms | 23.7% A100 fp16 MFU | 86023 tok/s
step    7/18865 | train loss 10.305791 | norm 7.1424 | lr 6.00e-06 | 6092.85 ms | 23.7% A100 fp16 MFU | 86028 tok/s
step    8/18865 | train loss 10.199792 | norm 6.1662 | lr 6.86e-06 | 6093.35 ms | 23.7% A100 fp16 MFU | 86030 tok/s
step    9/18865 | train loss 10.100173 | norm 5.2634 | lr 7.71e-06 | 6092.20 ms | 23.7% A100 fp16 MFU | 86034 tok/s
step   10/18865 | train loss 9.996456 | norm 4.5312 | lr 8.57e-06 | 6094.89 ms | 23.7% A100 fp16 MFU | 86033 tok/s
step   11/18865 | train loss 9.919384 | norm 3.8864 | lr 9.43e-06 | 6092.99 ms | 23.7% A100 fp16 MFU | 86034 tok/s
step   12/18865 | train loss 9.858330 | norm 3.3644 | lr 1.03e-05 | 6092.98 ms | 23.7% A100 fp16 MFU | 86036 tok/s
step   13/18865 | train loss 9.837819 | norm 2.9010 | lr 1.11e-05 | 6093.16 ms | 23.7% A100 fp16 MFU | 86037 tok/s
step   14/18865 | train loss 9.751850 | norm 2.7066 | lr 1.20e-05 | 6093.68 ms | 23.7% A100 fp16 MFU | 86037 tok/s
step   15/18865 | train loss 9.725508 | norm 2.4608 | lr 1.29e-05 | 6093.15 ms | 23.7% A100 fp16 MFU | 86038 tok/s
...
step 7991/18865 | train loss 3.446177 | norm 0.2582 | lr 3.92e-04 | 6089.58 ms | 23.7% A100 fp16 MFU | 86119 tok/s
step 7992/18865 | train loss 3.384956 | norm 0.2484 | lr 3.91e-04 | 6087.13 ms | 23.7% A100 fp16 MFU | 86119 tok/s
step 7993/18865 | train loss 3.498838 | norm 0.2632 | lr 3.91e-04 | 6089.31 ms | 23.7% A100 fp16 MFU | 86118 tok/s
step 7994/18865 | train loss 3.418818 | norm 0.2563 | lr 3.91e-04 | 6088.99 ms | 23.7% A100 fp16 MFU | 86118 tok/s
step 7995/18865 | train loss 3.402076 | norm 0.2621 | lr 3.91e-04 | 6087.39 ms | 23.7% A100 fp16 MFU | 86118 tok/s
step 7996/18865 | train loss 3.441767 | norm 0.2511 | lr 3.91e-04 | 6086.70 ms | 23.7% A100 fp16 MFU | 86119 tok/s
step 7997/18865 | train loss 3.492267 | norm 0.2781 | lr 3.91e-04 | 6088.95 ms | 23.7% A100 fp16 MFU | 86118 tok/s
step 7998/18865 | train loss 3.452062 | norm 0.2811 | lr 3.91e-04 | 6086.69 ms | 23.7% A100 fp16 MFU | 86119 tok/s
step 7999/18865 | train loss 3.473521 | norm 0.2644 | lr 3.91e-04 | 6088.95 ms | 23.7% A100 fp16 MFU | 86119 tok/s
step 8000/18865 | train loss 3.379415 | norm 0.2507 | lr 3.91e-04 | 6086.93 ms | 23.7% A100 fp16 MFU | 86119 tok/s
val loss 3.453447
HellaSwag: 2865/10042 = 0.285302
step 8001/18865 | train loss 3.452341 | norm 0.2759 | lr 3.91e-04 | 6093.24 ms | 23.7% A100 fp16 MFU | 86116 tok/s
step 8002/18865 | train loss 3.526994 | norm 0.2955 | lr 3.91e-04 | 6088.90 ms | 23.7% A100 fp16 MFU | 86115 tok/s
step 8003/18865 | train loss 3.495205 | norm 0.2835 | lr 3.91e-04 | 6089.39 ms | 23.7% A100 fp16 MFU | 86114 tok/s
step 8004/18865 | train loss 3.401451 | norm 0.2680 | lr 3.91e-04 | 6086.20 ms | 23.7% A100 fp16 MFU | 86116 tok/s
step 8005/18865 | train loss 3.365933 | norm 0.2794 | lr 3.91e-04 | 6088.50 ms | 23.7% A100 fp16 MFU | 86115 tok/s
step 8006/18865 | train loss 3.408287 | norm 0.2893 | lr 3.91e-04 | 6087.59 ms | 23.7% A100 fp16 MFU | 86116 tok/s
step 8007/18865 | train loss 3.434451 | norm 0.3275 | lr 3.91e-04 | 6089.67 ms | 23.7% A100 fp16 MFU | 86115 tok/s
step 8008/18865 | train loss 3.482856 | norm 0.3121 | lr 3.91e-04 | 6086.01 ms | 23.7% A100 fp16 MFU | 86116 tok/s
step 8009/18865 | train loss 3.559215 | norm 0.3293 | lr 3.91e-04 | 6089.70 ms | 23.7% A100 fp16 MFU | 86115 tok/s
step 8010/18865 | train loss 3.424921 | norm 0.3412 | lr 3.91e-04 | 6087.94 ms | 23.7% A100 fp16 MFU | 86116 tok/s
step 8011/18865 | train loss 3.361625 | norm 0.3038 | lr 3.91e-04 | 6088.61 ms | 23.7% A100 fp16 MFU | 86115 tok/s
step 8012/18865 | train loss 3.390574 | norm 0.3079 | lr 3.90e-04 | 6087.31 ms | 23.7% A100 fp16 MFU | 86116 tok/s
...
step 18861/18865 | train loss 3.276321 | norm 0.2545 | lr 1.07e-10 | 6091.70 ms | 23.7% A100 fp16 MFU | 86071 tok/s
step 18862/18865 | train loss 3.295762 | norm 0.2116 | lr 7.15e-11 | 6093.72 ms | 23.7% A100 fp16 MFU | 86070 tok/s
step 18863/18865 | train loss 3.283065 | norm 0.2151 | lr 3.58e-11 | 6087.44 ms | 23.7% A100 fp16 MFU | 86073 tok/s
step 18864/18865 | train loss 3.275241 | norm 0.2181 | lr 1.79e-11 | 6094.07 ms | 23.7% A100 fp16 MFU | 86071 tok/s
step 18865/18865 | train loss 3.298655 | norm 0.2109 | lr 0.00e+00 | 6092.45 ms | 23.7% A100 fp16 MFU | 86070 tok/s
val loss 3.296501
HellaSwag: 3013/10042 = 0.300040
generating:
---
The She was blown away in the bar and front bar
View All My Photos
Learn more about her. Read about the director in today's Denon Walters-Lippes' . The incident occurred before her wedding. Very poor judgment on the part of the baker and proof that a woman killed her fiancé
---
Writing model to log124M/model_00018865.bin
Writing state to log124M/state_00018865_00000.bin
total average iteration time: 6086.300854 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment