Skip to content

Instantly share code, notes, and snippets.

@mdouze
Created November 28, 2023 15:35
Show Gist options
  • Save mdouze/c85f69f7ac997cdc9b9096e3640e0423 to your computer and use it in GitHub Desktop.
Save mdouze/c85f69f7ac997cdc9b9096e3640e0423 to your computer and use it in GitHub Desktop.
$ python -u train_qinco.py --db bigann1B --M 8 --L 2 --h 256 --lr 0.001 --ngpu 4 --model models/test_model.pt
args: Namespace(todo=['train_rq', 'train', 'train_ivf'], db='bigann1B', training_data='', nt=500000, nval=10000, db_scale=-1, ivf=False, M=8, L=2, K=256, h=256, rq_beam_size=1, ngpu=4, lr=0.001,
max_epochs=1000, batch_size=1024, RQ_filename='', IVF_filename='', model='models/test_model.pt', checkpoint='')
nb processors 80
model name : Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Tue Nov 28 07:57:28 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB On | 00000000:06:00.0 Off | 0 |
| N/A 31C P0 42W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-16GB On | 00000000:07:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-16GB On | 00000000:0A:00.0 Off | 0 |
| N/A 33C P0 42W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-16GB On | 00000000:85:00.0 Off | 0 |
| N/A 31C P0 43W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Loading dataset bigann1B
dataset in dimension 128, with metric L2, size: Q 10000 B 1000000000 T 100000000
Training set: (500000, 128), validation: (10000, 128)
====================== residual quantizer training
training RQ 8x8, beam_size=1
[14.87 s] training done
train set MSE=26038 validation MSE=26406.3
RQ centroids size (8, 256, 128)
====================== training
Initializing model from RQ
QINCo(
(codebook0): Embedding(256, 128)
(step1): QINCoStep(
(codebook): Embedding(256, 128)
(MLPconcat): Linear(in_features=256, out_features=128, bias=True)
(residual_block0): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
(residual_block1): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
)
(step2): QINCoStep(
(codebook): Embedding(256, 128)
(MLPconcat): Linear(in_features=256, out_features=128, bias=True)
(residual_block0): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
(residual_block1): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
)
(step3): QINCoStep(
(codebook): Embedding(256, 128)
(MLPconcat): Linear(in_features=256, out_features=128, bias=True)
(residual_block0): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
(residual_block1): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
)
(step4): QINCoStep(
(codebook): Embedding(256, 128)
(MLPconcat): Linear(in_features=256, out_features=128, bias=True)
(residual_block0): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
(residual_block1): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
)
(step5): QINCoStep(
(codebook): Embedding(256, 128)
(MLPconcat): Linear(in_features=256, out_features=128, bias=True)
(residual_block0): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
(residual_block1): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
)
(step6): QINCoStep(
(codebook): Embedding(256, 128)
(MLPconcat): Linear(in_features=256, out_features=128, bias=True)
(residual_block0): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
(residual_block1): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
)
(step7): QINCoStep(
(codebook): Embedding(256, 128)
(MLPconcat): Linear(in_features=256, out_features=128, bias=True)
(residual_block0): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
(residual_block1): Sequential(
(0): Linear(in_features=128, out_features=256, bias=False)
(1): ReLU()
(2): Linear(in_features=256, out_features=128, bias=False)
)
)
)
nb trainable parameters 1409920
Setting scaling factor to 246.0
Running on 4 GPUs
Start train_job rank=0
Setting up distribtued data parallel bs=1024
Before optimization: val MSE=568655E=568655
[3.18 s] epoch 0 lr=0.001
End of epoch 0 train loss 0.0474199 val MSE=24048.6
Best validation loss so far, storing models/test_model.pt
[21.16 s] epoch 1 lr=0.001
End of epoch 1 train loss 0.0418424 val MSE=22044.5
Best validation loss so far, storing models/test_model.pt
[38.31 s] epoch 2 lr=0.001
End of epoch 2 train loss 0.0403962 val MSE=20648.4
Best validation loss so far, storing models/test_model.pt
[55.48 s] epoch 3 lr=0.001
End of epoch 3 train loss 0.039486 val MSE=19536.5
Best validation loss so far, storing models/test_model.pt
[72.58 s] epoch 4 lr=0.001
End of epoch 4 train loss 0.0388234 val MSE=18717.2
Best validation loss so far, storing models/test_model.pt
[89.75 s] epoch 5 lr=0.001
End of epoch 5 train loss 0.038346 val MSE=18023.5
Best validation loss so far, storing models/test_model.pt
[106.88 s] epoch 6 lr=0.001
End of epoch 6 train loss 0.0379748 val MSE=17650.8
Best validation loss so far, storing models/test_model.pt
[124.07 s] epoch 7 lr=0.001
End of epoch 7 train loss 0.0376687 val MSE=17181.9
Best validation loss so far, storing models/test_model.pt
[141.17 s] epoch 8 lr=0.001
End of epoch 8 train loss 0.0374314 val MSE=16818.1
Best validation loss so far, storing models/test_model.pt
[158.35 s] epoch 9 lr=0.001
End of epoch 9 train loss 0.0372263 val MSE=16460.2
Best validation loss so far, storing models/test_model.pt
[175.42 s] epoch 10 lr=0.001
End of epoch 10 train loss 0.0370482 val MSE=16194.4
Best validation loss so far, storing models/test_model.pt
[192.62 s] epoch 11 lr=0.001
End of epoch 11 train loss 0.0369036 val MSE=16079
Best validation loss so far, storing models/test_model.pt
[209.69 s] epoch 12 lr=0.001
End of epoch 12 train loss 0.036769 val MSE=15892.3
Best validation loss so far, storing models/test_model.pt
[226.88 s] epoch 13 lr=0.001
End of epoch 13 train loss 0.0366591 val MSE=15725.9
Best validation loss so far, storing models/test_model.pt
[243.97 s] epoch 14 lr=0.001
End of epoch 14 train loss 0.0365515 val MSE=15519.1
Best validation loss so far, storing models/test_model.pt
[261.08 s] epoch 15 lr=0.001
End of epoch 15 train loss 0.0364743 val MSE=15442.4
Best validation loss so far, storing models/test_model.pt
[278.22 s] epoch 16 lr=0.001
End of epoch 16 train loss 0.0363938 val MSE=15313.2
Best validation loss so far, storing models/test_model.pt
[295.33 s] epoch 17 lr=0.001
End of epoch 17 train loss 0.0363131 val MSE=15185.6
Best validation loss so far, storing models/test_model.pt
[312.43 s] epoch 18 lr=0.001
End of epoch 18 train loss 0.0362477 val MSE=15110.6
Best validation loss so far, storing models/test_model.pt
[329.55 s] epoch 19 lr=0.001
End of epoch 19 train loss 0.0361872 val MSE=15137.5
[346.63 s] epoch 20 lr=0.001
End of epoch 20 train loss 0.0361288 val MSE=14985.2
Best validation loss so far, storing models/test_model.pt
[363.72 s] epoch 21 lr=0.001
End of epoch 21 train loss 0.0360852 val MSE=14950.1
Best validation loss so far, storing models/test_model.pt
[380.86 s] epoch 22 lr=0.001
End of epoch 22 train loss 0.036039 val MSE=14874.6
Best validation loss so far, storing models/test_model.pt
[398.00 s] epoch 23 lr=0.001
End of epoch 23 train loss 0.0359996 val MSE=14845.6
Best validation loss so far, storing models/test_model.pt
[415.11 s] epoch 24 lr=0.001
End of epoch 24 train loss 0.0359464 val MSE=14748.7
Best validation loss so far, storing models/test_model.pt
[432.20 s] epoch 25 lr=0.001
End of epoch 25 train loss 0.0359245 val MSE=14725.5
Best validation loss so far, storing models/test_model.pt
[449.32 s] epoch 26 lr=0.001
End of epoch 26 train loss 0.0358768 val MSE=14605.4
Best validation loss so far, storing models/test_model.pt
[466.45 s] epoch 27 lr=0.001
End of epoch 27 train loss 0.0358431 val MSE=14694.2
[483.48 s] epoch 28 lr=0.001
End of epoch 28 train loss 0.0358138 val MSE=14610.1
[500.53 s] epoch 29 lr=0.001
End of epoch 29 train loss 0.0357878 val MSE=14601
Best validation loss so far, storing models/test_model.pt
[517.61 s] epoch 30 lr=0.001
End of epoch 30 train loss 0.0357512 val MSE=14527.5
Best validation loss so far, storing models/test_model.pt
[534.74 s] epoch 31 lr=0.001
End of epoch 31 train loss 0.0357195 val MSE=14466.7
Best validation loss so far, storing models/test_model.pt
[551.82 s] epoch 32 lr=0.001
End of epoch 32 train loss 0.0356953 val MSE=14502.5
[568.88 s] epoch 33 lr=0.001
End of epoch 33 train loss 0.035675 val MSE=14487.2
[585.94 s] epoch 34 lr=0.001
End of epoch 34 train loss 0.035653 val MSE=14484.8
[603.02 s] epoch 35 lr=0.001
End of epoch 35 train loss 0.0356293 val MSE=14475
[620.05 s] epoch 36 lr=0.001
End of epoch 36 train loss 0.0356023 val MSE=14385.8
Best validation loss so far, storing models/test_model.pt
[637.15 s] epoch 37 lr=0.001
End of epoch 37 train loss 0.035583 val MSE=14360.3
Best validation loss so far, storing models/test_model.pt
[654.26 s] epoch 38 lr=0.001
End of epoch 38 train loss 0.0355544 val MSE=14337.7
Best validation loss so far, storing models/test_model.pt
[671.37 s] epoch 39 lr=0.001
End of epoch 39 train loss 0.0355401 val MSE=14360.6
[688.43 s] epoch 40 lr=0.001
End of epoch 40 train loss 0.0355217 val MSE=14288.9
Best validation loss so far, storing models/test_model.pt
[705.53 s] epoch 41 lr=0.001
End of epoch 41 train loss 0.0355137 val MSE=14323.7
[722.58 s] epoch 42 lr=0.001
End of epoch 42 train loss 0.0354883 val MSE=14270.8
Best validation loss so far, storing models/test_model.pt
[739.69 s] epoch 43 lr=0.001
End of epoch 43 train loss 0.0354726 val MSE=14311.4
[756.79 s] epoch 44 lr=0.001
End of epoch 44 train loss 0.0354563 val MSE=14178.6
Best validation loss so far, storing models/test_model.pt
[773.98 s] epoch 45 lr=0.001
End of epoch 45 train loss 0.035442 val MSE=14214.2
[791.10 s] epoch 46 lr=0.001
End of epoch 46 train loss 0.0354202 val MSE=14211
[808.15 s] epoch 47 lr=0.001
End of epoch 47 train loss 0.0354138 val MSE=14179.1
[825.22 s] epoch 48 lr=0.001
End of epoch 48 train loss 0.0353981 val MSE=14136.5
Best validation loss so far, storing models/test_model.pt
[842.29 s] epoch 49 lr=0.001
End of epoch 49 train loss 0.0353856 val MSE=14136.8
[859.41 s] epoch 50 lr=0.001
End of epoch 50 train loss 0.0353724 val MSE=14186.3
[876.48 s] epoch 51 lr=0.001
End of epoch 51 train loss 0.0353531 val MSE=14086.1
Best validation loss so far, storing models/test_model.pt
[893.58 s] epoch 52 lr=0.001
End of epoch 52 train loss 0.0353404 val MSE=14076.8
Best validation loss so far, storing models/test_model.pt
[910.65 s] epoch 53 lr=0.001
End of epoch 53 train loss 0.0353271 val MSE=14112.9
[927.76 s] epoch 54 lr=0.001
End of epoch 54 train loss 0.0353135 val MSE=14077.5
[944.81 s] epoch 55 lr=0.001
End of epoch 55 train loss 0.0353028 val MSE=14129.4
[961.92 s] epoch 56 lr=0.001
End of epoch 56 train loss 0.035289 val MSE=14139.3
[979.33 s] epoch 57 lr=0.001
End of epoch 57 train loss 0.0352794 val MSE=14057.3
Best validation loss so far, storing models/test_model.pt
[996.84 s] epoch 58 lr=0.001
End of epoch 58 train loss 0.035267 val MSE=14141.5
[1013.77 s] epoch 59 lr=0.001
End of epoch 59 train loss 0.0352646 val MSE=14043.6
Best validation loss so far, storing models/test_model.pt
[1030.71 s] epoch 60 lr=0.001
End of epoch 60 train loss 0.0352546 val MSE=14031.1
Best validation loss so far, storing models/test_model.pt
[1047.77 s] epoch 61 lr=0.001
End of epoch 61 train loss 0.0352413 val MSE=14035.5
[1064.96 s] epoch 62 lr=0.001
End of epoch 62 train loss 0.0352205 val MSE=14077.1
[1081.92 s] epoch 63 lr=0.001
End of epoch 63 train loss 0.0352137 val MSE=14115.9
[1098.95 s] epoch 64 lr=0.001
End of epoch 64 train loss 0.0352076 val MSE=14017.5
Best validation loss so far, storing models/test_model.pt
[1116.44 s] epoch 65 lr=0.001
End of epoch 65 train loss 0.0352072 val MSE=13964.2
Best validation loss so far, storing models/test_model.pt
[1133.42 s] epoch 66 lr=0.001
End of epoch 66 train loss 0.0351961 val MSE=14037.7
[1150.38 s] epoch 67 lr=0.001
End of epoch 67 train loss 0.035179 val MSE=13938
Best validation loss so far, storing models/test_model.pt
[1167.38 s] epoch 68 lr=0.001
End of epoch 68 train loss 0.0351727 val MSE=14010.6
[1184.37 s] epoch 69 lr=0.001
End of epoch 69 train loss 0.0351638 val MSE=13925.7
Best validation loss so far, storing models/test_model.pt
[1201.35 s] epoch 70 lr=0.001
End of epoch 70 train loss 0.0351474 val MSE=13972.8
[1218.37 s] epoch 71 lr=0.001
End of epoch 71 train loss 0.0351449 val MSE=13949.5
[1235.34 s] epoch 72 lr=0.001
End of epoch 72 train loss 0.0351377 val MSE=14018
[1252.32 s] epoch 73 lr=0.001
End of epoch 73 train loss 0.0351269 val MSE=13963.5
[1269.25 s] epoch 74 lr=0.001
End of epoch 74 train loss 0.0351198 val MSE=13973.7
[1286.21 s] epoch 75 lr=0.001
End of epoch 75 train loss 0.0351066 val MSE=13872.6
Best validation loss so far, storing models/test_model.pt
[1303.22 s] epoch 76 lr=0.001
End of epoch 76 train loss 0.0351025 val MSE=13919.4
[1320.24 s] epoch 77 lr=0.001
End of epoch 77 train loss 0.0350998 val MSE=13896.3
[1337.40 s] epoch 78 lr=0.001
End of epoch 78 train loss 0.0350904 val MSE=13911.9
[1354.56 s] epoch 79 lr=0.001
End of epoch 79 train loss 0.035081 val MSE=13940.7
[1371.65 s] epoch 80 lr=0.001
End of epoch 80 train loss 0.0350802 val MSE=13912.2
[1389.01 s] epoch 81 lr=0.001
End of epoch 81 train loss 0.03507 val MSE=13861.8
Best validation loss so far, storing models/test_model.pt
[1405.98 s] epoch 82 lr=0.001
End of epoch 82 train loss 0.0350648 val MSE=13946.6
[1423.00 s] epoch 83 lr=0.001
End of epoch 83 train loss 0.0350582 val MSE=13929.8
[1440.05 s] epoch 84 lr=0.001
End of epoch 84 train loss 0.0350461 val MSE=13899.6
[1457.10 s] epoch 85 lr=0.001
End of epoch 85 train loss 0.0350462 val MSE=13883.6
[1474.14 s] epoch 86 lr=0.001
End of epoch 86 train loss 0.0350334 val MSE=13848.7
Best validation loss so far, storing models/test_model.pt
[1491.24 s] epoch 87 lr=0.001
End of epoch 87 train loss 0.0350324 val MSE=13918.4
[1508.28 s] epoch 88 lr=0.001
End of epoch 88 train loss 0.0350158 val MSE=13880.4
[1525.29 s] epoch 89 lr=0.001
End of epoch 89 train loss 0.0350155 val MSE=13932.9
[1542.32 s] epoch 90 lr=0.001
End of epoch 90 train loss 0.0350152 val MSE=13881.5
[1559.41 s] epoch 91 lr=0.001
End of epoch 91 train loss 0.0350087 val MSE=13903.9
[1576.38 s] epoch 92 lr=0.001
End of epoch 92 train loss 0.0350062 val MSE=13822.7
Best validation loss so far, storing models/test_model.pt
[1593.41 s] epoch 93 lr=0.001
End of epoch 93 train loss 0.0350026 val MSE=13866.3
[1610.34 s] epoch 94 lr=0.001
End of epoch 94 train loss 0.034997 val MSE=13850.8
[1627.40 s] epoch 95 lr=0.001
End of epoch 95 train loss 0.03499 val MSE=13786.5
Best validation loss so far, storing models/test_model.pt
[1644.37 s] epoch 96 lr=0.001
End of epoch 96 train loss 0.0349817 val MSE=13861
[1661.36 s] epoch 97 lr=0.001
End of epoch 97 train loss 0.0349857 val MSE=13849.2
[1678.29 s] epoch 98 lr=0.001
End of epoch 98 train loss 0.0349745 val MSE=13825.3
[1695.30 s] epoch 99 lr=0.001
End of epoch 99 train loss 0.0349738 val MSE=13823.2
[1712.24 s] epoch 100 lr=0.001
End of epoch 100 train loss 0.0349657 val MSE=13851.2
[1729.18 s] epoch 101 lr=0.001
End of epoch 101 train loss 0.0349583 val MSE=13892
[1746.11 s] epoch 102 lr=0.001
End of epoch 102 train loss 0.0349473 val MSE=13819.2
[1763.09 s] epoch 103 lr=0.001
End of epoch 103 train loss 0.0349428 val MSE=13819.1
[1780.05 s] epoch 104 lr=0.001
End of epoch 104 train loss 0.0349477 val MSE=13811.5
[1796.98 s] epoch 105 lr=0.001
End of epoch 105 train loss 0.0349389 val MSE=13755.3
Best validation loss so far, storing models/test_model.pt
[1813.96 s] epoch 106 lr=0.001
End of epoch 106 train loss 0.0349356 val MSE=13820.1
[1830.95 s] epoch 107 lr=0.001
End of epoch 107 train loss 0.0349218 val MSE=13838.6
[1847.92 s] epoch 108 lr=0.001
End of epoch 108 train loss 0.0349209 val MSE=13812
[1865.34 s] epoch 109 lr=0.001
End of epoch 109 train loss 0.0349153 val MSE=13732.1
Best validation loss so far, storing models/test_model.pt
[1882.79 s] epoch 110 lr=0.001
End of epoch 110 train loss 0.0349254 val MSE=13757.6
[1900.19 s] epoch 111 lr=0.001
End of epoch 111 train loss 0.034914 val MSE=13773.8
[1917.32 s] epoch 112 lr=0.001
End of epoch 112 train loss 0.0349089 val MSE=13829.4
[1934.26 s] epoch 113 lr=0.001
End of epoch 113 train loss 0.0349052 val MSE=13823.3
[1951.24 s] epoch 114 lr=0.001
End of epoch 114 train loss 0.0348975 val MSE=13791
[1968.27 s] epoch 115 lr=0.001
End of epoch 115 train loss 0.0348917 val MSE=13761.4
[1985.36 s] epoch 116 lr=0.001
End of epoch 116 train loss 0.0348968 val MSE=13797.3
[2002.41 s] epoch 117 lr=0.001
End of epoch 117 train loss 0.0348868 val MSE=13734.2
[2019.48 s] epoch 118 lr=0.001
End of epoch 118 train loss 0.0348782 val MSE=13688
Best validation loss so far, storing models/test_model.pt
[2036.54 s] epoch 119 lr=0.001
End of epoch 119 train loss 0.0348779 val MSE=13720.6
[2053.64 s] epoch 120 lr=0.001
End of epoch 120 train loss 0.0348834 val MSE=13757
[2070.69 s] epoch 121 lr=0.001
End of epoch 121 train loss 0.0348723 val MSE=13733.7
[2087.76 s] epoch 122 lr=0.001
End of epoch 122 train loss 0.0348692 val MSE=13845.6
[2104.81 s] epoch 123 lr=0.001
End of epoch 123 train loss 0.0348614 val MSE=13707.3
[2121.89 s] epoch 124 lr=0.001
End of epoch 124 train loss 0.0348586 val MSE=13777.7
[2138.96 s] epoch 125 lr=0.001
End of epoch 125 train loss 0.0348534 val MSE=13762
[2155.99 s] epoch 126 lr=0.001
End of epoch 126 train loss 0.0348556 val MSE=13764
[2173.03 s] epoch 127 lr=0.001
End of epoch 127 train loss 0.0348563 val MSE=13759.1
[2190.12 s] epoch 128 lr=0.001
End of epoch 128 train loss 0.0348509 val MSE=13698.4
Val loss did not improve for 10 epochs, reduce LR
[2207.17 s] epoch 129 lr=0.0001
End of epoch 129 train loss 0.0339622 val MSE=12991.3
Best validation loss so far, storing models/test_model.pt
[2224.25 s] epoch 130 lr=0.0001
End of epoch 130 train loss 0.0337109 val MSE=12915.7
Best validation loss so far, storing models/test_model.pt
[2241.32 s] epoch 131 lr=0.0001
End of epoch 131 train loss 0.033613 val MSE=12893
Best validation loss so far, storing models/test_model.pt
[2258.43 s] epoch 132 lr=0.0001
End of epoch 132 train loss 0.0335497 val MSE=12912.7
[2275.48 s] epoch 133 lr=0.0001
End of epoch 133 train loss 0.0335045 val MSE=12925.9
[2292.52 s] epoch 134 lr=0.0001
End of epoch 134 train loss 0.0334663 val MSE=12902.4
[2309.56 s] epoch 135 lr=0.0001
End of epoch 135 train loss 0.0334346 val MSE=12931.5
[2326.65 s] epoch 136 lr=0.0001
End of epoch 136 train loss 0.0334061 val MSE=12904.1
[2343.70 s] epoch 137 lr=0.0001
End of epoch 137 train loss 0.0333828 val MSE=12890.6
Best validation loss so far, storing models/test_model.pt
[2360.77 s] epoch 138 lr=0.0001
End of epoch 138 train loss 0.0333611 val MSE=12885.1
Best validation loss so far, storing models/test_model.pt
[2377.86 s] epoch 139 lr=0.0001
End of epoch 139 train loss 0.0333408 val MSE=12878.6
Best validation loss so far, storing models/test_model.pt
[2394.97 s] epoch 140 lr=0.0001
End of epoch 140 train loss 0.0333246 val MSE=12870.6
Best validation loss so far, storing models/test_model.pt
[2412.07 s] epoch 141 lr=0.0001
End of epoch 141 train loss 0.0333066 val MSE=12877.2
[2429.12 s] epoch 142 lr=0.0001
End of epoch 142 train loss 0.0332918 val MSE=12857.1
Best validation loss so far, storing models/test_model.pt
[2446.22 s] epoch 143 lr=0.0001
End of epoch 143 train loss 0.0332767 val MSE=12872.6
[2463.31 s] epoch 144 lr=0.0001
End of epoch 144 train loss 0.0332651 val MSE=12884.6
[2480.35 s] epoch 145 lr=0.0001
End of epoch 145 train loss 0.0332513 val MSE=12864.2
[2497.42 s] epoch 146 lr=0.0001
End of epoch 146 train loss 0.0332435 val MSE=12886.3
[2514.57 s] epoch 147 lr=0.0001
End of epoch 147 train loss 0.0332279 val MSE=12898.8
[2531.67 s] epoch 148 lr=0.0001
End of epoch 148 train loss 0.0332206 val MSE=12892.7
[2548.68 s] epoch 149 lr=0.0001
End of epoch 149 train loss 0.0332094 val MSE=12880.3
[2565.73 s] epoch 150 lr=0.0001
End of epoch 150 train loss 0.0332005 val MSE=12871.1
[2582.76 s] epoch 151 lr=0.0001
End of epoch 151 train loss 0.0331914 val MSE=12861
[2599.81 s] epoch 152 lr=0.0001
End of epoch 152 train loss 0.0331836 val MSE=12865.8
Val loss did not improve for 10 epochs, reduce LR
[2616.82 s] epoch 153 lr=1e-05
End of epoch 153 train loss 0.0330098 val MSE=12807.1
Best validation loss so far, storing models/test_model.pt
[2633.89 s] epoch 154 lr=1e-05
End of epoch 154 train loss 0.0329883 val MSE=12809.3
[2650.98 s] epoch 155 lr=1e-05
End of epoch 155 train loss 0.0329796 val MSE=12800.4
Best validation loss so far, storing models/test_model.pt
[2668.06 s] epoch 156 lr=1e-05
End of epoch 156 train loss 0.0329735 val MSE=12811.2
[2685.12 s] epoch 157 lr=1e-05
End of epoch 157 train loss 0.0329685 val MSE=12806.8
[2702.14 s] epoch 158 lr=1e-05
End of epoch 158 train loss 0.0329637 val MSE=12809.2
[2719.89 s] epoch 159 lr=1e-05
End of epoch 159 train loss 0.0329608 val MSE=12807.8
[2736.97 s] epoch 160 lr=1e-05
End of epoch 160 train loss 0.0329578 val MSE=12806.5
[2754.06 s] epoch 161 lr=1e-05
End of epoch 161 train loss 0.0329552 val MSE=12809.8
[2771.14 s] epoch 162 lr=1e-05
End of epoch 162 train loss 0.0329529 val MSE=12804.8
[2788.27 s] epoch 163 lr=1e-05
End of epoch 163 train loss 0.0329498 val MSE=12810.1
[2805.36 s] epoch 164 lr=1e-05
End of epoch 164 train loss 0.0329475 val MSE=12823.1
[2822.45 s] epoch 165 lr=1e-05
End of epoch 165 train loss 0.0329453 val MSE=12811.2
Val loss did not improve for 10 epochs, reduce LR
[2839.50 s] epoch 166 lr=1e-06
End of epoch 166 train loss 0.0329214 val MSE=12814.9
[2856.64 s] epoch 167 lr=1e-06
End of epoch 167 train loss 0.0329197 val MSE=12811.3
[2873.73 s] epoch 168 lr=1e-06
End of epoch 168 train loss 0.0329192 val MSE=12812.5
[2890.79 s] epoch 169 lr=1e-06
End of epoch 169 train loss 0.0329189 val MSE=12811.2
[2907.87 s] epoch 170 lr=1e-06
End of epoch 170 train loss 0.0329187 val MSE=12814.8
[2924.98 s] epoch 171 lr=1e-06
End of epoch 171 train loss 0.0329184 val MSE=12816.7
[2942.07 s] epoch 172 lr=1e-06
End of epoch 172 train loss 0.0329184 val MSE=12812.9
[2959.57 s] epoch 173 lr=1e-06
End of epoch 173 train loss 0.0329184 val MSE=12812.9
[2977.04 s] epoch 174 lr=1e-06
End of epoch 174 train loss 0.0329173 val MSE=12814.7
[2994.30 s] epoch 175 lr=1e-06
End of epoch 175 train loss 0.0329176 val MSE=12814.6
[3011.38 s] epoch 176 lr=1e-06
End of epoch 176 train loss 0.0329165 val MSE=12812.6
Val loss did not improve for 10 epochs, reduce LR
[3028.87 s] epoch 177 lr=1e-07
End of epoch 177 train loss 0.0329145 val MSE=12815.6
[3046.31 s] epoch 178 lr=1e-07
End of epoch 178 train loss 0.032914 val MSE=12815.6
[3063.71 s] epoch 179 lr=1e-07
End of epoch 179 train loss 0.0329146 val MSE=12815.3
[3081.04 s] epoch 180 lr=1e-07
End of epoch 180 train loss 0.0329149 val MSE=12814.9
[3098.53 s] epoch 181 lr=1e-07
End of epoch 181 train loss 0.0329145 val MSE=12814.2
[3115.63 s] epoch 182 lr=1e-07
End of epoch 182 train loss 0.032914 val MSE=12814.1
[3132.66 s] epoch 183 lr=1e-07
End of epoch 183 train loss 0.0329136 val MSE=12814.5
[3149.66 s] epoch 184 lr=1e-07
End of epoch 184 train loss 0.0329143 val MSE=12815.4
[3166.70 s] epoch 185 lr=1e-07
End of epoch 185 train loss 0.0329141 val MSE=12813.4
[3183.75 s] epoch 186 lr=1e-07
End of epoch 186 train loss 0.032914 val MSE=12814.7
[3200.78 s] epoch 187 lr=1e-07
End of epoch 187 train loss 0.0329136 val MSE=12813.9
Val loss did not improve for 10 epochs, reduce LR
LR too small, stopping
Stop train_job rank=0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment