Skip to content

Instantly share code, notes, and snippets.

@ehartford
Last active December 21, 2023 22:33
Show Gist options
  • Save ehartford/1ba7e45f269e4d58792477079a772b86 to your computer and use it in GitHub Desktop.
Save ehartford/1ba7e45f269e4d58792477079a772b86 to your computer and use it in GitHub Desktop.
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `4`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/workspace/axolotl/transformers/src/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/workspace/axolotl/transformers/src/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/workspace/axolotl/transformers/src/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/workspace/axolotl/transformers/src/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Saving the dataset (0/16 shards): 0%| | 0/1012643 [00:00<?, ? examples/s]
Saving the dataset (0/16 shards): 0%| | 1000/1012643 [00:00<01:54, 8829.67 examples/s]
Saving the dataset (0/16 shards): 1%| | 7000/1012643 [00:00<00:27, 35938.21 examples/s]
Saving the dataset (0/16 shards): 1%|▏ | 15000/1012643 [00:00<00:19, 50130.08 examples/s]
Saving the dataset (0/16 shards): 2%|▏ | 23000/1012643 [00:00<00:16, 58545.39 examples/s]
Saving the dataset (0/16 shards): 3%|▎ | 31000/1012643 [00:00<00:15, 65082.04 examples/s]
Saving the dataset (0/16 shards): 4%|▍ | 41000/1012643 [00:00<00:13, 71294.64 examples/s]
Saving the dataset (0/16 shards): 5%|▌ | 51000/1012643 [00:00<00:12, 76690.74 examples/s]
Saving the dataset (0/16 shards): 6%|▌ | 61000/1012643 [00:00<00:11, 81487.64 examples/s]
Saving the dataset (1/16 shards): 6%|▋ | 63291/1012643 [00:01<00:11, 81487.64 examples/s]
Saving the dataset (1/16 shards): 7%|▋ | 70291/1012643 [00:01<00:34, 27344.42 examples/s]
Saving the dataset (1/16 shards): 8%|▊ | 80291/1012643 [00:01<00:26, 35691.33 examples/s]
Saving the dataset (1/16 shards): 9%|▉ | 91291/1012643 [00:01<00:20, 45750.36 examples/s]
Saving the dataset (1/16 shards): 10%|█ | 102291/1012643 [00:02<00:16, 55786.10 examples/s]
Saving the dataset (1/16 shards): 11%|█ | 112291/1012643 [00:02<00:14, 64129.69 examples/s]
Saving the dataset (1/16 shards): 12%|█▏ | 124291/1012643 [00:02<00:12, 73684.29 examples/s]
Saving the dataset (2/16 shards): 13%|█▎ | 126582/1012643 [00:02<00:12, 73684.29 examples/s]
Saving the dataset (2/16 shards): 13%|█▎ | 134582/1012643 [00:02<00:14, 59359.72 examples/s]
Saving the dataset (2/16 shards): 14%|█▍ | 146582/1012643 [00:02<00:12, 69546.21 examples/s]
Saving the dataset (2/16 shards): 16%|█▌ | 158582/1012643 [00:02<00:10, 78199.86 examples/s]
Saving the dataset (2/16 shards): 17%|█▋ | 170582/1012643 [00:02<00:09, 86547.37 examples/s]
Saving the dataset (2/16 shards): 18%|█▊ | 182582/1012643 [00:02<00:08, 93435.48 examples/s]
Saving the dataset (3/16 shards): 19%|█▉ | 189873/1012643 [00:04<00:08, 93435.48 examples/s]
Saving the dataset (3/16 shards): 20%|█▉ | 197873/1012643 [00:04<00:29, 27310.85 examples/s]
Saving the dataset (3/16 shards): 21%|██ | 209873/1012643 [00:04<00:23, 34872.66 examples/s]
Saving the dataset (3/16 shards): 22%|██▏ | 221873/1012643 [00:04<00:18, 43556.12 examples/s]
Saving the dataset (3/16 shards): 23%|██▎ | 233873/1012643 [00:04<00:14, 53152.28 examples/s]
Saving the dataset (3/16 shards): 24%|██▍ | 245873/1012643 [00:04<00:12, 63128.12 examples/s]
Saving the dataset (4/16 shards): 25%|██▌ | 253163/1012643 [00:05<00:12, 63128.12 examples/s]
Saving the dataset (4/16 shards): 26%|██▌ | 261163/1012643 [00:06<00:32, 23375.01 examples/s]
Saving the dataset (4/16 shards): 27%|██▋ | 273163/1012643 [00:06<00:24, 30079.37 examples/s]
Saving the dataset (4/16 shards): 28%|██▊ | 285163/1012643 [00:06<00:19, 37998.87 examples/s]
Saving the dataset (4/16 shards): 29%|██▉ | 297163/1012643 [00:06<00:15, 47268.07 examples/s]
Saving the dataset (4/16 shards): 31%|███ | 309163/1012643 [00:06<00:12, 57287.12 examples/s]
Saving the dataset (5/16 shards): 31%|███▏ | 316453/1012643 [00:11<00:12, 57287.12 examples/s]
Saving the dataset (5/16 shards): 32%|███▏ | 324453/1012643 [00:11<01:27, 7824.12 examples/s]
Saving the dataset (5/16 shards): 33%|███▎ | 336453/1012643 [00:11<01:03, 10601.66 examples/s]
Saving the dataset (5/16 shards): 34%|███▍ | 348453/1012643 [00:11<00:46, 14308.66 examples/s]
Saving the dataset (5/16 shards): 36%|███▌ | 360453/1012643 [00:11<00:33, 19211.28 examples/s]
Saving the dataset (5/16 shards): 37%|███▋ | 372453/1012643 [00:12<00:25, 25467.34 examples/s]
Saving the dataset (6/16 shards): 38%|███▊ | 379743/1012643 [00:14<00:24, 25467.34 examples/s]
Saving the dataset (6/16 shards): 38%|███▊ | 387743/1012643 [00:14<00:54, 11457.54 examples/s]
Saving the dataset (6/16 shards): 39%|███▉ | 399743/1012643 [00:14<00:39, 15335.91 examples/s]
Saving the dataset (6/16 shards): 41%|████ | 411743/1012643 [00:14<00:29, 20380.43 examples/s]
Saving the dataset (6/16 shards): 42%|████▏ | 423743/1012643 [00:15<00:21, 26807.35 examples/s]
Saving the dataset (6/16 shards): 43%|████▎ | 435743/1012643 [00:15<00:16, 34621.82 examples/s]
Saving the dataset (7/16 shards): 44%|████▍ | 443033/1012643 [00:20<00:16, 34621.82 examples/s]
Saving the dataset (7/16 shards): 45%|████▍ | 451033/1012643 [00:20<01:16, 7389.54 examples/s]
Saving the dataset (7/16 shards): 46%|████▌ | 463033/1012643 [00:20<00:54, 10023.28 examples/s]
Saving the dataset (7/16 shards): 47%|████▋ | 475033/1012643 [00:20<00:39, 13575.80 examples/s]
Saving the dataset (7/16 shards): 48%|████▊ | 487033/1012643 [00:20<00:28, 18270.05 examples/s]
Saving the dataset (7/16 shards): 49%|████▉ | 501033/1012643 [00:20<00:20, 25244.21 examples/s]
Saving the dataset (8/16 shards): 50%|█████ | 506323/1012643 [00:24<00:20, 25244.21 examples/s]
Saving the dataset (8/16 shards): 51%|█████ | 514323/1012643 [00:24<01:01, 8159.22 examples/s]
Saving the dataset (8/16 shards): 52%|█████▏ | 526323/1012643 [00:24<00:43, 11067.07 examples/s]
Saving the dataset (8/16 shards): 53%|█████▎ | 539323/1012643 [00:25<00:30, 15289.06 examples/s]
Saving the dataset (8/16 shards): 54%|█████▍ | 551323/1012643 [00:25<00:22, 20364.94 examples/s]
Saving the dataset (8/16 shards): 56%|█████▌ | 563323/1012643 [00:25<00:16, 26740.87 examples/s]
Saving the dataset (9/16 shards): 56%|█████▋ | 569613/1012643 [00:26<00:16, 26740.87 examples/s]
Saving the dataset (9/16 shards): 57%|█████▋ | 578613/1012643 [00:27<00:28, 15275.05 examples/s]
Saving the dataset (9/16 shards): 58%|█████▊ | 590613/1012643 [00:27<00:20, 20144.79 examples/s]
Saving the dataset (9/16 shards): 60%|█████▉ | 602613/1012643 [00:27<00:15, 26342.58 examples/s]
Saving the dataset (9/16 shards): 61%|██████ | 614613/1012643 [00:27<00:11, 33941.78 examples/s]
Saving the dataset (9/16 shards): 62%|██████▏ | 626613/1012643 [00:27<00:08, 42913.77 examples/s]
Saving the dataset (10/16 shards): 63%|██████▎ | 632903/1012643 [00:30<00:08, 42913.77 examples/s]
Saving the dataset (10/16 shards): 63%|██████▎ | 640903/1012643 [00:30<00:29, 12732.57 examples/s]
Saving the dataset (10/16 shards): 64%|██████▍ | 652903/1012643 [00:30<00:21, 17067.97 examples/s]
Saving the dataset (10/16 shards): 66%|██████▌ | 664903/1012643 [00:30<00:15, 22658.89 examples/s]
Saving the dataset (10/16 shards): 67%|██████▋ | 676903/1012643 [00:30<00:11, 29686.26 examples/s]
Saving the dataset (10/16 shards): 68%|██████▊ | 689903/1012643 [00:30<00:08, 38746.94 examples/s]
Saving the dataset (11/16 shards): 69%|██████▉ | 696193/1012643 [00:35<00:08, 38746.94 examples/s]
Saving the dataset (11/16 shards): 70%|██████▉ | 704193/1012643 [00:35<00:41, 7427.92 examples/s]
Saving the dataset (11/16 shards): 71%|███████ | 716193/1012643 [00:35<00:29, 10105.39 examples/s]
Saving the dataset (11/16 shards): 72%|███████▏ | 728193/1012643 [00:35<00:20, 13703.99 examples/s]
Saving the dataset (11/16 shards): 73%|███████▎ | 740193/1012643 [00:35<00:14, 18458.07 examples/s]
Saving the dataset (11/16 shards): 74%|███████▍ | 753193/1012643 [00:36<00:10, 25037.65 examples/s]
Saving the dataset (12/16 shards): 75%|███████▌ | 759483/1012643 [00:40<00:10, 25037.65 examples/s]
Saving the dataset (12/16 shards): 76%|███████▌ | 766483/1012643 [00:41<00:37, 6638.34 examples/s]
Saving the dataset (12/16 shards): 77%|███████▋ | 778483/1012643 [00:41<00:25, 9103.36 examples/s]
Saving the dataset (12/16 shards): 78%|███████▊ | 790483/1012643 [00:41<00:17, 12444.99 examples/s]
Saving the dataset (12/16 shards): 79%|███████▉ | 804483/1012643 [00:41<00:11, 17595.11 examples/s]
Saving the dataset (12/16 shards): 81%|████████ | 816483/1012643 [00:41<00:08, 23248.89 examples/s]
Saving the dataset (13/16 shards): 81%|████████▏ | 822773/1012643 [00:43<00:08, 23248.89 examples/s]
Saving the dataset (13/16 shards): 82%|████████▏ | 830773/1012643 [00:43<00:13, 13584.68 examples/s]
Saving the dataset (13/16 shards): 83%|████████▎ | 842773/1012643 [00:43<00:09, 18097.70 examples/s]
Saving the dataset (13/16 shards): 84%|████████▍ | 854773/1012643 [00:43<00:06, 23766.88 examples/s]
Saving the dataset (13/16 shards): 86%|████████▌ | 866773/1012643 [00:44<00:04, 30946.14 examples/s]
Saving the dataset (13/16 shards): 87%|████████▋ | 878773/1012643 [00:44<00:03, 39334.29 examples/s]
Saving the dataset (14/16 shards): 88%|████████▊ | 886063/1012643 [00:45<00:03, 39334.29 examples/s]
Saving the dataset (14/16 shards): 88%|████████▊ | 894063/1012643 [00:45<00:05, 22053.38 examples/s]
Saving the dataset (14/16 shards): 89%|████████▉ | 906063/1012643 [00:45<00:03, 28430.92 examples/s]
Saving the dataset (14/16 shards): 91%|█████████ | 918063/1012643 [00:45<00:02, 36094.73 examples/s]
Saving the dataset (14/16 shards): 92%|█████████▏| 930063/1012643 [00:45<00:01, 44900.16 examples/s]
Saving the dataset (14/16 shards): 93%|█████████▎| 942063/1012643 [00:45<00:01, 54728.06 examples/s]
Saving the dataset (15/16 shards): 94%|█████████▍| 949353/1012643 [00:50<00:01, 54728.06 examples/s]
Saving the dataset (15/16 shards): 95%|█████████▍| 957353/1012643 [00:50<00:06, 8016.98 examples/s]
Saving the dataset (15/16 shards): 96%|█████████▌| 969353/1012643 [00:50<00:03, 10848.54 examples/s]
Saving the dataset (15/16 shards): 97%|█████████▋| 981353/1012643 [00:50<00:02, 14659.85 examples/s]
Saving the dataset (15/16 shards): 98%|█████████▊| 993353/1012643 [00:51<00:00, 19611.31 examples/s]
Saving the dataset (15/16 shards): 99%|█████████▉| 1005353/1012643 [00:51<00:00, 25953.90 examples/s]
Saving the dataset (16/16 shards): 100%|██████████| 1012643/1012643 [00:53<00:00, 25953.90 examples/s]
Saving the dataset (16/16 shards): 100%|██████████| 1012643/1012643 [00:53<00:00, 18839.42 examples/s]
Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s]
Loading checkpoint shards: 0%| | 0/19 [00:00<?, ?it/s]
Loading checkpoint shards: 5%|▌ | 1/19 [00:02<00:39, 2.20s/it]
Loading checkpoint shards: 5%|▌ | 1/19 [00:02<00:41, 2.29s/it]
Loading checkpoint shards: 5%|▌ | 1/19 [00:02<00:42, 2.35s/it]
Loading checkpoint shards: 5%|▌ | 1/19 [00:02<00:42, 2.38s/it]
Loading checkpoint shards: 11%|█ | 2/19 [00:04<00:37, 2.18s/it]
Loading checkpoint shards: 11%|█ | 2/19 [00:04<00:37, 2.23s/it]
Loading checkpoint shards: 11%|█ | 2/19 [00:04<00:39, 2.33s/it]
Loading checkpoint shards: 11%|█ | 2/19 [00:04<00:39, 2.34s/it]
Loading checkpoint shards: 16%|█▌ | 3/19 [00:06<00:35, 2.20s/it]
Loading checkpoint shards: 16%|█▌ | 3/19 [00:06<00:35, 2.21s/it]
Loading checkpoint shards: 16%|█▌ | 3/19 [00:07<00:37, 2.34s/it]
Loading checkpoint shards: 16%|█▌ | 3/19 [00:07<00:37, 2.34s/it]
Loading checkpoint shards: 21%|██ | 4/19 [00:08<00:34, 2.27s/it]
Loading checkpoint shards: 21%|██ | 4/19 [00:09<00:34, 2.29s/it]
Loading checkpoint shards: 21%|██ | 4/19 [00:09<00:34, 2.29s/it]
Loading checkpoint shards: 21%|██ | 4/19 [00:09<00:34, 2.33s/it]
Loading checkpoint shards: 26%|██▋ | 5/19 [00:11<00:31, 2.24s/it]
Loading checkpoint shards: 26%|██▋ | 5/19 [00:11<00:31, 2.25s/it]
Loading checkpoint shards: 26%|██▋ | 5/19 [00:11<00:31, 2.28s/it]
Loading checkpoint shards: 26%|██▋ | 5/19 [00:11<00:32, 2.34s/it]
Loading checkpoint shards: 32%|███▏ | 6/19 [00:13<00:28, 2.22s/it]
Loading checkpoint shards: 32%|███▏ | 6/19 [00:13<00:29, 2.23s/it]
Loading checkpoint shards: 32%|███▏ | 6/19 [00:13<00:29, 2.29s/it]
Loading checkpoint shards: 32%|███▏ | 6/19 [00:14<00:30, 2.34s/it]
Loading checkpoint shards: 37%|███▋ | 7/19 [00:15<00:26, 2.20s/it]
Loading checkpoint shards: 37%|███▋ | 7/19 [00:15<00:26, 2.20s/it]
Loading checkpoint shards: 37%|███▋ | 7/19 [00:16<00:27, 2.28s/it]
Loading checkpoint shards: 37%|███▋ | 7/19 [00:16<00:27, 2.33s/it]
Loading checkpoint shards: 42%|████▏ | 8/19 [00:17<00:24, 2.19s/it]
Loading checkpoint shards: 42%|████▏ | 8/19 [00:17<00:24, 2.21s/it]
Loading checkpoint shards: 42%|████▏ | 8/19 [00:18<00:25, 2.28s/it]
Loading checkpoint shards: 42%|████▏ | 8/19 [00:18<00:25, 2.32s/it]
Loading checkpoint shards: 47%|████▋ | 9/19 [00:19<00:22, 2.21s/it]
Loading checkpoint shards: 47%|████▋ | 9/19 [00:20<00:22, 2.20s/it]
Loading checkpoint shards: 47%|████▋ | 9/19 [00:20<00:22, 2.29s/it]
Loading checkpoint shards: 47%|████▋ | 9/19 [00:21<00:23, 2.34s/it]
Loading checkpoint shards: 53%|█████▎ | 10/19 [00:22<00:19, 2.19s/it]
Loading checkpoint shards: 53%|█████▎ | 10/19 [00:22<00:19, 2.19s/it]
Loading checkpoint shards: 53%|█████▎ | 10/19 [00:22<00:20, 2.27s/it]
Loading checkpoint shards: 53%|█████▎ | 10/19 [00:23<00:20, 2.33s/it]
Loading checkpoint shards: 58%|█████▊ | 11/19 [00:24<00:17, 2.19s/it]
Loading checkpoint shards: 58%|█████▊ | 11/19 [00:24<00:17, 2.19s/it]
Loading checkpoint shards: 58%|█████▊ | 11/19 [00:25<00:18, 2.30s/it]
Loading checkpoint shards: 58%|█████▊ | 11/19 [00:25<00:18, 2.32s/it]
Loading checkpoint shards: 63%|██████▎ | 12/19 [00:26<00:15, 2.19s/it]
Loading checkpoint shards: 63%|██████▎ | 12/19 [00:26<00:15, 2.19s/it]
Loading checkpoint shards: 63%|██████▎ | 12/19 [00:27<00:16, 2.30s/it]
Loading checkpoint shards: 63%|██████▎ | 12/19 [00:27<00:16, 2.33s/it]
Loading checkpoint shards: 68%|██████▊ | 13/19 [00:28<00:13, 2.18s/it]
Loading checkpoint shards: 68%|██████▊ | 13/19 [00:28<00:13, 2.19s/it]
Loading checkpoint shards: 68%|██████▊ | 13/19 [00:29<00:13, 2.31s/it]
Loading checkpoint shards: 68%|██████▊ | 13/19 [00:30<00:13, 2.33s/it]
Loading checkpoint shards: 74%|███████▎ | 14/19 [00:30<00:10, 2.17s/it]
Loading checkpoint shards: 74%|███████▎ | 14/19 [00:30<00:10, 2.19s/it]
Loading checkpoint shards: 74%|███████▎ | 14/19 [00:32<00:11, 2.30s/it]
Loading checkpoint shards: 74%|███████▎ | 14/19 [00:32<00:11, 2.31s/it]
Loading checkpoint shards: 79%|███████▉ | 15/19 [00:32<00:08, 2.18s/it]
Loading checkpoint shards: 79%|███████▉ | 15/19 [00:33<00:08, 2.20s/it]
Loading checkpoint shards: 79%|███████▉ | 15/19 [00:34<00:09, 2.30s/it]
Loading checkpoint shards: 79%|███████▉ | 15/19 [00:34<00:09, 2.33s/it]
Loading checkpoint shards: 84%|████████▍ | 16/19 [00:35<00:06, 2.18s/it]
Loading checkpoint shards: 84%|████████▍ | 16/19 [00:35<00:06, 2.21s/it]
Loading checkpoint shards: 84%|████████▍ | 16/19 [00:36<00:06, 2.32s/it]
Loading checkpoint shards: 84%|████████▍ | 16/19 [00:37<00:06, 2.33s/it]
Loading checkpoint shards: 89%|████████▉ | 17/19 [00:37<00:04, 2.19s/it]
Loading checkpoint shards: 89%|████████▉ | 17/19 [00:37<00:04, 2.20s/it]
Loading checkpoint shards: 89%|████████▉ | 17/19 [00:39<00:04, 2.31s/it]
Loading checkpoint shards: 95%|█████████▍| 18/19 [00:39<00:02, 2.20s/it]
Loading checkpoint shards: 89%|████████▉ | 17/19 [00:39<00:04, 2.32s/it]
Loading checkpoint shards: 95%|█████████▍| 18/19 [00:39<00:02, 2.21s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:41<00:00, 2.08s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:41<00:00, 2.18s/it]
Loading checkpoint shards: 95%|█████████▍| 18/19 [00:41<00:02, 2.32s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:41<00:00, 2.10s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:41<00:00, 2.19s/it]
Loading checkpoint shards: 95%|█████████▍| 18/19 [00:41<00:02, 2.32s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:43<00:00, 2.19s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:43<00:00, 2.28s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:43<00:00, 2.19s/it]
Loading checkpoint shards: 100%|██████████| 19/19 [00:43<00:00, 2.31s/it]
Using /workspace/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /workspace/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /workspace/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /workspace/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /workspace/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Traceback (most recent call last):
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
fire.Fire(do_cli)
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/workspace/axolotl/src/axolotl/train.py", line 129, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/workspace/axolotl/transformers/src/transformers/trainer.py", line 1543, in train
return inner_training_loop(
File "/workspace/axolotl/transformers/src/transformers/trainer.py", line 1699, in _inner_training_loop
deepspeed_load_checkpoint(self.model_wrapped, resume_from_checkpoint)
File "/workspace/axolotl/transformers/src/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
load_path, _ = deepspeed_engine.load_checkpoint(
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2720, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2790, in _load_checkpoint
self.load_module_state_dict(checkpoint=checkpoint,
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2583, in load_module_state_dict
self.module.load_state_dict(
File "/home/azureuser/miniconda3/envs/axolotl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict: "base_model.model.model.embed_tokens.original_module.weight", "base_model.model.model.layers.0.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.0.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.gate.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.0.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.0.input_layernorm.weight", "base_model.model.model.layers.0.post_attention_layernorm.weight", "base_model.model.model.layers.1.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.1.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.gate.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.1.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.1.input_layernorm.weight", "base_model.model.model.layers.1.post_attention_layernorm.weight", "base_model.model.model.layers.2.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.2.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.gate.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.2.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.2.input_layernorm.weight", "base_model.model.model.layers.2.post_attention_layernorm.weight", "base_model.model.model.layers.3.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.3.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.gate.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.3.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.3.input_layernorm.weight", "base_model.model.model.layers.3.post_attention_layernorm.weight", "base_model.model.model.layers.4.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.4.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.gate.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.4.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.4.input_layernorm.weight", "base_model.model.model.layers.4.post_attention_layernorm.weight", "base_model.model.model.layers.5.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.5.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.5.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.5.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.gate.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.5.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.5.input_layernorm.weight", "base_model.model.model.layers.5.post_attention_layernorm.weight", "base_model.model.model.layers.6.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.6.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.6.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.6.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.gate.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.6.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.6.input_layernorm.weight", "base_model.model.model.layers.6.post_attention_layernorm.weight", "base_model.model.model.layers.7.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.7.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.7.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.7.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.gate.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.7.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.7.input_layernorm.weight", "base_model.model.model.layers.7.post_attention_layernorm.weight", "base_model.model.model.layers.8.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.8.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.8.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.8.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.gate.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.8.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.8.input_layernorm.weight", "base_model.model.model.layers.8.post_attention_layernorm.weight", "base_model.model.model.layers.9.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.9.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.9.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.9.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.gate.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.6.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.6.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.6.w3.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.7.w1.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.7.w2.base_layer.weight", "base_model.model.model.layers.9.block_sparse_moe.experts.7.w3.base_layer.weight", "base_model.model.model.layers.9.input_layernorm.weight", "base_model.model.model.layers.9.post_attention_layernorm.weight", "base_model.model.model.layers.10.self_attn.q_proj.base_layer.weight", "base_model.model.model.layers.10.self_attn.k_proj.base_layer.weight", "base_model.model.model.layers.10.self_attn.v_proj.base_layer.weight", "base_model.model.model.layers.10.self_attn.o_proj.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.gate.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.0.w1.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.0.w2.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.0.w3.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.1.w1.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.1.w2.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.1.w3.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.2.w1.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.2.w2.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.2.w3.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.3.w1.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.3.w2.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.3.w3.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.4.w1.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.4.w2.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.4.w3.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.5.w1.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.5.w2.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe.experts.5.w3.base_layer.weight", "base_model.model.model.layers.10.block_sparse_moe... (303 KB left)
[2023-12-21 21:24:13,589] [WARNING] [axolotl.validate_config:250] [PID:8259] [RANK:0] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-21 21:24:13,591] [INFO] [axolotl.normalize_config:150] [PID:8259] [RANK:0] GPU memory usage baseline: 0.000GB (+0.858GB misc)
[2023-12-21 21:24:13,596] [WARNING] [axolotl.validate_config:250] [PID:8262] [RANK:3] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-21 21:24:13,598] [INFO] [axolotl.normalize_config:150] [PID:8262] [RANK:3] GPU memory usage baseline: 0.000GB (+0.858GB misc)
[2023-12-21 21:24:13,611] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-21 21:24:13,614] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-21 21:24:13,739] [WARNING] [axolotl.validate_config:250] [PID:8261] [RANK:2] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-21 21:24:13,741] [INFO] [axolotl.normalize_config:150] [PID:8261] [RANK:2] GPU memory usage baseline: 0.000GB (+0.858GB misc)
[2023-12-21 21:24:13,757] [WARNING] [axolotl.validate_config:250] [PID:8260] [RANK:1] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
[2023-12-21 21:24:13,759] [INFO] [axolotl.normalize_config:150] [PID:8260] [RANK:1] GPU memory usage baseline: 0.000GB (+0.858GB misc)
[2023-12-21 21:24:13,759] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-21 21:24:13,777] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-21 21:24:15,971] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-21 21:24:16,197] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-21 21:24:16,289] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-12-21 21:24:16,289] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-12-21 21:24:16,351] [INFO] [comm.py:637:init_distributed] cdb=None
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' `88 `8bd8' 88' `88 88 88' `88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
`88888P8 dP' `dP `88888P' dP `88888P' dP dP
[2023-12-21 21:24:16,530] [DEBUG] [axolotl.load_tokenizer:167] [PID:8261] [RANK:2] EOS: 32000 / <|im_end|>
[2023-12-21 21:24:16,530] [DEBUG] [axolotl.load_tokenizer:168] [PID:8261] [RANK:2] BOS: 1 / <s>
[2023-12-21 21:24:16,530] [DEBUG] [axolotl.load_tokenizer:169] [PID:8261] [RANK:2] PAD: 2 / </s>
[2023-12-21 21:24:16,530] [DEBUG] [axolotl.load_tokenizer:170] [PID:8261] [RANK:2] UNK: 0 / <unk>
[2023-12-21 21:24:16,532] [DEBUG] [axolotl.load_tokenizer:167] [PID:8259] [RANK:0] EOS: 32000 / <|im_end|>
[2023-12-21 21:24:16,532] [DEBUG] [axolotl.load_tokenizer:168] [PID:8259] [RANK:0] BOS: 1 / <s>
[2023-12-21 21:24:16,532] [DEBUG] [axolotl.load_tokenizer:169] [PID:8259] [RANK:0] PAD: 2 / </s>
[2023-12-21 21:24:16,532] [DEBUG] [axolotl.load_tokenizer:170] [PID:8259] [RANK:0] UNK: 0 / <unk>
[2023-12-21 21:24:16,533] [INFO] [axolotl.load_tokenized_prepared_datasets:147] [PID:8259] [RANK:0] Unable to find prepared dataset in last_run_prepared/a7d05eb3f13184aa9249688865626206
[2023-12-21 21:24:16,533] [INFO] [axolotl.load_tokenized_prepared_datasets:148] [PID:8259] [RANK:0] Loading raw datasets...
[2023-12-21 21:24:16,533] [INFO] [axolotl.load_tokenized_prepared_datasets:153] [PID:8259] [RANK:0] No seed provided, using default seed of 42
[2023-12-21 21:24:16,538] [DEBUG] [axolotl.load_tokenizer:167] [PID:8262] [RANK:3] EOS: 32000 / <|im_end|>
[2023-12-21 21:24:16,538] [DEBUG] [axolotl.load_tokenizer:168] [PID:8262] [RANK:3] BOS: 1 / <s>
[2023-12-21 21:24:16,538] [DEBUG] [axolotl.load_tokenizer:169] [PID:8262] [RANK:3] PAD: 2 / </s>
[2023-12-21 21:24:16,538] [DEBUG] [axolotl.load_tokenizer:170] [PID:8262] [RANK:3] UNK: 0 / <unk>
[2023-12-21 21:24:16,541] [DEBUG] [axolotl.load_tokenizer:167] [PID:8260] [RANK:1] EOS: 32000 / <|im_end|>
[2023-12-21 21:24:16,542] [DEBUG] [axolotl.load_tokenizer:168] [PID:8260] [RANK:1] BOS: 1 / <s>
[2023-12-21 21:24:16,542] [DEBUG] [axolotl.load_tokenizer:169] [PID:8260] [RANK:1] PAD: 2 / </s>
[2023-12-21 21:24:16,542] [DEBUG] [axolotl.load_tokenizer:170] [PID:8260] [RANK:1] UNK: 0 / <unk>
[2023-12-21 21:24:18,957] [INFO] [axolotl.load_tokenized_prepared_datasets:362] [PID:8259] [RANK:0] merging datasets
[2023-12-21 21:24:19,025] [INFO] [axolotl.load_tokenized_prepared_datasets:366] [PID:8259] [RANK:0] shuffle merged datasets
[2023-12-21 21:24:19,046] [INFO] [axolotl.load_tokenized_prepared_datasets:369] [PID:8259] [RANK:0] Saving merged prepared dataset to disk... last_run_prepared/a7d05eb3f13184aa9249688865626206
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:147] [PID:8262] [RANK:3] Unable to find prepared dataset in last_run_prepared/a7d05eb3f13184aa9249688865626206
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:147] [PID:8260] [RANK:1] Unable to find prepared dataset in last_run_prepared/a7d05eb3f13184aa9249688865626206
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:148] [PID:8262] [RANK:3] Loading raw datasets...
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:147] [PID:8261] [RANK:2] Unable to find prepared dataset in last_run_prepared/a7d05eb3f13184aa9249688865626206
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:148] [PID:8260] [RANK:1] Loading raw datasets...
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:153] [PID:8262] [RANK:3] No seed provided, using default seed of 42
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:148] [PID:8261] [RANK:2] Loading raw datasets...
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:153] [PID:8260] [RANK:1] No seed provided, using default seed of 42
[2023-12-21 21:25:16,667] [INFO] [axolotl.load_tokenized_prepared_datasets:153] [PID:8261] [RANK:2] No seed provided, using default seed of 42
[2023-12-21 21:25:18,851] [INFO] [axolotl.load_tokenized_prepared_datasets:362] [PID:8260] [RANK:1] merging datasets
[2023-12-21 21:25:18,888] [INFO] [axolotl.load_tokenized_prepared_datasets:362] [PID:8261] [RANK:2] merging datasets
[2023-12-21 21:25:18,927] [INFO] [axolotl.load_tokenized_prepared_datasets:366] [PID:8260] [RANK:1] shuffle merged datasets
[2023-12-21 21:25:18,963] [INFO] [axolotl.load_tokenized_prepared_datasets:366] [PID:8261] [RANK:2] shuffle merged datasets
[2023-12-21 21:25:19,618] [INFO] [axolotl.load_tokenized_prepared_datasets:362] [PID:8262] [RANK:3] merging datasets
[2023-12-21 21:25:19,693] [INFO] [axolotl.load_tokenized_prepared_datasets:366] [PID:8262] [RANK:3] shuffle merged datasets
[2023-12-21 21:25:22,209] [DEBUG] [axolotl.log:60] [PID:8259] [RANK:0] total_num_tokens: 577123019
[2023-12-21 21:25:29,857] [DEBUG] [axolotl.log:60] [PID:8259] [RANK:0] `total_supervised_tokens: 336627609`
[2023-12-21 21:25:38,896] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8259] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:25:38,896] [DEBUG] [axolotl.log:60] [PID:8259] [RANK:0] data_loader_len: 11623
[2023-12-21 21:25:43,214] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8260] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:25:43,419] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8262] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:25:43,552] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8261] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:25:43,567] [INFO] [axolotl.log:60] [PID:8259] [RANK:0] sample_packing_eff_est across ranks: [0.9902671575546265, 0.9902671575546265, 0.9903507232666016, 0.9902671575546265]
[2023-12-21 21:25:43,567] [DEBUG] [axolotl.log:60] [PID:8259] [RANK:0] sample_packing_eff_est: 1.0
[2023-12-21 21:25:43,568] [DEBUG] [axolotl.log:60] [PID:8259] [RANK:0] total_num_steps: 8717
[2023-12-21 21:25:43,576] [DEBUG] [axolotl.train.log:60] [PID:8259] [RANK:0] loading tokenizer... /workspace/models/Mixtral-8x7B-v0.1
[2023-12-21 21:25:43,676] [DEBUG] [axolotl.load_tokenizer:167] [PID:8259] [RANK:0] EOS: 32000 / <|im_end|>
[2023-12-21 21:25:43,676] [DEBUG] [axolotl.load_tokenizer:168] [PID:8259] [RANK:0] BOS: 1 / <s>
[2023-12-21 21:25:43,676] [DEBUG] [axolotl.load_tokenizer:169] [PID:8259] [RANK:0] PAD: 2 / </s>
[2023-12-21 21:25:43,676] [DEBUG] [axolotl.load_tokenizer:170] [PID:8259] [RANK:0] UNK: 0 / <unk>
[2023-12-21 21:25:43,676] [DEBUG] [axolotl.train.log:60] [PID:8259] [RANK:0] loading model and peft_config...
[2023-12-21 21:25:43,677] [INFO] [axolotl.load_model:262] [PID:8259] [RANK:0] patching with flash attention
[2023-12-21 21:25:43,678] [DEBUG] [axolotl.load_tokenizer:167] [PID:8260] [RANK:1] EOS: 32000 / <|im_end|>
[2023-12-21 21:25:43,678] [DEBUG] [axolotl.load_tokenizer:168] [PID:8260] [RANK:1] BOS: 1 / <s>
[2023-12-21 21:25:43,678] [DEBUG] [axolotl.load_tokenizer:169] [PID:8260] [RANK:1] PAD: 2 / </s>
[2023-12-21 21:25:43,678] [DEBUG] [axolotl.load_tokenizer:170] [PID:8260] [RANK:1] UNK: 0 / <unk>
[2023-12-21 21:25:43,679] [DEBUG] [axolotl.load_tokenizer:167] [PID:8261] [RANK:2] EOS: 32000 / <|im_end|>
[2023-12-21 21:25:43,679] [DEBUG] [axolotl.load_tokenizer:168] [PID:8261] [RANK:2] BOS: 1 / <s>
[2023-12-21 21:25:43,679] [DEBUG] [axolotl.load_tokenizer:169] [PID:8261] [RANK:2] PAD: 2 / </s>
[2023-12-21 21:25:43,679] [DEBUG] [axolotl.load_tokenizer:170] [PID:8261] [RANK:2] UNK: 0 / <unk>
[2023-12-21 21:25:43,680] [INFO] [axolotl.load_model:262] [PID:8260] [RANK:1] patching with flash attention
[2023-12-21 21:25:43,680] [INFO] [axolotl.load_model:262] [PID:8261] [RANK:2] patching with flash attention
[2023-12-21 21:25:43,680] [DEBUG] [axolotl.load_tokenizer:167] [PID:8262] [RANK:3] EOS: 32000 / <|im_end|>
[2023-12-21 21:25:43,680] [DEBUG] [axolotl.load_tokenizer:168] [PID:8262] [RANK:3] BOS: 1 / <s>
[2023-12-21 21:25:43,680] [DEBUG] [axolotl.load_tokenizer:169] [PID:8262] [RANK:3] PAD: 2 / </s>
[2023-12-21 21:25:43,680] [DEBUG] [axolotl.load_tokenizer:170] [PID:8262] [RANK:3] UNK: 0 / <unk>
[2023-12-21 21:25:43,681] [INFO] [axolotl.load_model:262] [PID:8262] [RANK:3] patching with flash attention
[2023-12-21 21:26:27,154] [INFO] [axolotl.load_model:505] [PID:8259] [RANK:0] GPU memory usage after model load: 23.792GB (+0.196GB cache, +1.500GB misc)
[2023-12-21 21:26:27,159] [INFO] [axolotl.load_model:528] [PID:8259] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2023-12-21 21:26:27,171] [INFO] [axolotl.load_model:540] [PID:8259] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2023-12-21 21:26:27,201] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.<module>:16] [PID:8259] CUDA extension not installed.
[2023-12-21 21:26:27,202] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.<module>:15] [PID:8259] CUDA extension not installed.
[2023-12-21 21:26:27,466] [INFO] [axolotl.load_model:505] [PID:8260] [RANK:1] GPU memory usage after model load: 23.792GB (+0.196GB cache, +1.500GB misc)
[2023-12-21 21:26:27,471] [INFO] [axolotl.load_model:528] [PID:8260] [RANK:1] converting PEFT model w/ prepare_model_for_kbit_training
[2023-12-21 21:26:27,483] [INFO] [axolotl.load_model:540] [PID:8260] [RANK:1] converting modules to torch.bfloat16 for flash attention
[2023-12-21 21:26:27,511] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.<module>:16] [PID:8260] CUDA extension not installed.
[2023-12-21 21:26:27,512] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.<module>:15] [PID:8260] CUDA extension not installed.
[2023-12-21 21:26:29,286] [INFO] [axolotl.load_model:505] [PID:8262] [RANK:3] GPU memory usage after model load: 23.792GB (+0.196GB cache, +1.500GB misc)
[2023-12-21 21:26:29,290] [INFO] [axolotl.load_model:528] [PID:8262] [RANK:3] converting PEFT model w/ prepare_model_for_kbit_training
[2023-12-21 21:26:29,303] [INFO] [axolotl.load_model:540] [PID:8262] [RANK:3] converting modules to torch.bfloat16 for flash attention
[2023-12-21 21:26:29,332] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.<module>:16] [PID:8262] CUDA extension not installed.
[2023-12-21 21:26:29,333] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.<module>:15] [PID:8262] CUDA extension not installed.
[2023-12-21 21:26:30,823] [INFO] [axolotl.load_model:505] [PID:8261] [RANK:2] GPU memory usage after model load: 23.792GB (+0.196GB cache, +1.500GB misc)
[2023-12-21 21:26:30,827] [INFO] [axolotl.load_model:528] [PID:8261] [RANK:2] converting PEFT model w/ prepare_model_for_kbit_training
[2023-12-21 21:26:30,840] [INFO] [axolotl.load_model:540] [PID:8261] [RANK:2] converting modules to torch.bfloat16 for flash attention
[2023-12-21 21:26:30,869] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.<module>:16] [PID:8261] CUDA extension not installed.
[2023-12-21 21:26:30,870] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.<module>:15] [PID:8261] CUDA extension not installed.
trainable params: 742,408,192 || all params: 47,445,217,280 || trainable%: 1.5647692951191392
trainable params: 742,408,192 || all params: 47,445,217,280 || trainable%: 1.5647692951191392
[2023-12-21 21:26:31,049] [INFO] [axolotl.load_model:570] [PID:8259] [RANK:0] GPU memory usage after adapters: 26.093GB (+0.071GB cache, +1.500GB misc)
[2023-12-21 21:26:31,058] [INFO] [axolotl.train.log:60] [PID:8259] [RANK:0] Pre-saving adapter config to /workspace/dolphin-2.6-mixtral-8x7b
[2023-12-21 21:26:31,061] [INFO] [axolotl.train.log:60] [PID:8259] [RANK:0] Starting trainer...
[2023-12-21 21:26:31,092] [INFO] [axolotl.load_model:570] [PID:8260] [RANK:1] GPU memory usage after adapters: 26.093GB (+0.071GB cache, +1.500GB misc)
trainable params: 742,408,192 || all params: 47,445,217,280 || trainable%: 1.5647692951191392
[2023-12-21 21:26:32,980] [INFO] [axolotl.load_model:570] [PID:8262] [RANK:3] GPU memory usage after adapters: 26.093GB (+0.071GB cache, +1.500GB misc)
[2023-12-21 21:26:34,346] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8260] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:34,403] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8259] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
trainable params: 742,408,192 || all params: 47,445,217,280 || trainable%: 1.5647692951191392
[2023-12-21 21:26:34,533] [INFO] [axolotl.load_model:570] [PID:8261] [RANK:2] GPU memory usage after adapters: 26.093GB (+0.071GB cache, +1.500GB misc)
[2023-12-21 21:26:34,781] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8260] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:34,837] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8259] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:35,216] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8260] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:35,271] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8259] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:35,651] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8260] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:35,710] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8259] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:36,177] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8262] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:36,619] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8262] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:37,057] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8262] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:37,499] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8262] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:37,823] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8261] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:38,264] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8261] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:38,704] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8261] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
[2023-12-21 21:26:39,151] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:8261] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 144280754
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.8 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
ninja: no work to do.
Time to load cpu_adam op: 2.415133237838745 seconds
Time to load cpu_adam op: 2.425196647644043 seconds
Time to load cpu_adam op: 2.431770086288452 seconds
Time to load cpu_adam op: 2.4151530265808105 seconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment