Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save anijain2305/1f69533eefcce69719f5805e733a8586 to your computer and use it in GitHub Desktop.
Save anijain2305/1f69533eefcce69719f5805e733a8586 to your computer and use it in GitHub Desktop.
******* loading model args.model='t5'
--> World Size = 1
--> Device_count = 2
--> running with these defaults train_config(seed=2023, verbose=True, total_steps_to_run=8, warmup_steps=5, use_orig_params=True, limit_all_gathers=True, use_ddp=False, ddp_bucket_size=25, ddp_use_gradient_view=False, hf_t5_checkpointing=False, print_memory_summary=False, print_training_loss_data=False, num_epochs=4, model_weights_bf16=False, use_mixed_precision=True, use_low_precision_gradient_policy=False, use_tf32=True, optimizer='AdamW', ap_use_kahan_summation=False, sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, print_sharding_plan=False, run_profiler=False, profile_folder='fsdp/profile_tracing', log_every=1, num_workers_dataloader=2, batch_size_training=16, fsdp_activation_checkpointing=True, use_fused_attention=False, use_parallel_attention=False, run_validation=True, memory_report=True, nccl_debug_handler=True, distributed_debug=True, use_non_recursive_wrapping=False, use_synthetic_data=False, use_deferred_init=False, use_torch_compile=True, save_model_checkpoint=False, load_model_checkpoint=False, checkpoint_max_save_count=2, save_optimizer=False, load_optimizer=False, optimizer_checkpoint_file='Adam-t5--1.pt', checkpoint_model_filename='t5--1.pt')
clearing gpu cache for all ranks
--> running with torch dist debug set to detail
--> total memory per gpu (GB) = 39.564
wrapping policy is functools.partial(<function transformer_auto_wrap_policy at 0x7f862c480ca0>, transformer_layer_cls={<class 'transformers.models.t5.modeling_t5.T5Block'>})
pokemon nor beans set not enabled
Found cached dataset csv (/data/home/anijain/.cache/huggingface/datasets/csv/default-6c28f355c35f3029/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 306.56it/s]
Found cached dataset csv (/data/home/anijain/.cache/huggingface/datasets/csv/default-6c28f355c35f3029/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 481.55it/s]
--> Prepping t5-small model ...
stats is ready....? _stats=defaultdict(<class 'list'>, {'best_accuracy': 0.0}), local_rank=0, rank=0
***** building the model ******
using deferred? False
vit, GPU peak memory allocation: 0.0GB, GPU peak memory reserved: 0.0GB, GPU peak memory active: 0.0GB
--> t5-small built.
built model with 60.506624M params
bf16 check passed
--> Running with mixed precision MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, keep_low_precision_grads=False, cast_forward_inputs=False, cast_root_forward_inputs=True, _module_classes_to_ignore=(<class 'torch.nn.modules.batchnorm._BatchNorm'>,)) policy
backward prefetch set to BackwardPrefetch.BACKWARD_PRE
sharding set to ShardingStrategy.FULL_SHARD
--> Batch Size = 16
vit, GPU peak memory allocation: 0.0GB, GPU peak memory reserved: 0.0GB, GPU peak memory active: 0.0GB
--> FSDP activation checkpointing in use
--> Torch.compile in use
local rank 0 init time = 2.1330501430202276
memory stats reset, ready to track
Running with AdamW optimizer, with fusion set to True
Epoch: 0 starting...
r0 Training Epoch: 0%| | 0/814 [00:00<?, ?it/s][rank0]:[2023-07-12 16:06:38,552] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:45,098] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:50,665] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:52,371] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:53,787] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:55,185] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:03,010] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:06,595] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:09,035] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:11,628] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:13,970] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:16,581] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
r0 Training Epoch: 0%|▎ | 1/814 [01:00<13:33:31, 60.04s/it]step: 1: time taken for the last 1 steps is 0.08296773280017078, loss is 10.0
r0 Training Epoch: 0%|█ | 3/814 [01:00<3:31:17, 15.63s/it]step: 2: time taken for the last 1 steps is 0.09435613313689828, loss is 12.375
step: 3: time taken for the last 1 steps is 0.09597105905413628, loss is 12.4375
r0 Training Epoch: 1%|█▊ | 5/814 [01:00<1:43:04, 7.64s/it]step: 4: time taken for the last 1 steps is 0.09714369312860072, loss is 11.0625
step: 5: time taken for the last 1 steps is 0.09657704387791455, loss is 11.125
r0 Training Epoch: 1%|██▌ | 7/814 [01:00<59:49, 4.45s/it]step: 6: time taken for the last 1 steps is 0.0961470000911504, loss is 9.875
step: 7: time taken for the last 1 steps is 0.096938586095348, loss is 9.0625
r0 Training Epoch: 1%|██▊ | 8/814 [01:00<1:42:02, 7.60s/it]
tracking_duration [60.03929250803776, 0.08296773280017078, 0.09435613313689828, 0.09597105905413628, 0.09714369312860072, 0.09657704387791455, 0.0961470000911504, 0.096938586095348]
** exit loop - rank 0 reporting....
--> cuda max reserved memory = 4.7949
--> max reserved percentage = 12.12 %
--> cuda max memory allocated = 3.7062
--> max allocated percentage = 9.37 %
--> peak active memory = 3.7375
--> peak active memory 9.45 %
cudaMalloc retries = 0
cuda OOM = 0
Validation loss data
Accuracy validation
--> Highest Val Accuracy = 0
--> Step avg speed (in seconds) based on -5 steps: -0.0
excluding 5 steps as warmup
--> Step avg speed based on 3 steps: 0.0966 seconds
Dist Training Framework used = FSDP
This was run with TensorParallel? = False
Run with Parallel Attention? False
Batch size used = 16
FSDP Activation Checkpointing? = True
--> Model Size = 60.506624 M Params
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment