anijain2305/gist:1f69533eefcce69719f5805e733a8586

## gistfile1.txt
******* loading model args.model='t5'
--> World Size = 1

--> Device_count = 2
--> running with these defaults train_config(seed=2023, verbose=True, total_steps_to_run=8, warmup_steps=5, use_orig_params=True, limit_all_gathers=True, use_ddp=False, ddp_bucket_size=25, ddp_use_gradient_view=False, hf_t5_checkpointing=False, print_memory_summary=False, print_training_loss_data=False, num_epochs=4, model_weights_bf16=False, use_mixed_precision=True, use_low_precision_gradient_policy=False, use_tf32=True, optimizer='AdamW', ap_use_kahan_summation=False, sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, print_sharding_plan=False, run_profiler=False, profile_folder='fsdp/profile_tracing', log_every=1, num_workers_dataloader=2, batch_size_training=16, fsdp_activation_checkpointing=True, use_fused_attention=False, use_parallel_attention=False, run_validation=True, memory_report=True, nccl_debug_handler=True, distributed_debug=True, use_non_recursive_wrapping=False, use_synthetic_data=False, use_deferred_init=False, use_torch_compile=True, save_model_checkpoint=False, load_model_checkpoint=False, checkpoint_max_save_count=2, save_optimizer=False, load_optimizer=False, optimizer_checkpoint_file='Adam-t5--1.pt', checkpoint_model_filename='t5--1.pt')
clearing gpu cache for all ranks
--> running with torch dist debug set to detail
--> total memory per gpu (GB) = 39.564
wrapping policy is functools.partial(<function transformer_auto_wrap_policy at 0x7f862c480ca0>, transformer_layer_cls={<class 'transformers.models.t5.modeling_t5.T5Block'>})
pokemon nor beans set not enabled
Found cached dataset csv (/data/home/anijain/.cache/huggingface/datasets/csv/default-6c28f355c35f3029/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 306.56it/s]
Found cached dataset csv (/data/home/anijain/.cache/huggingface/datasets/csv/default-6c28f355c35f3029/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 481.55it/s]

--> Prepping t5-small model ...

stats is ready....? _stats=defaultdict(<class 'list'>, {'best_accuracy': 0.0}), local_rank=0, rank=0
***** building the model  ******
using deferred? False
vit, GPU peak memory allocation: 0.0GB, GPU peak memory reserved: 0.0GB, GPU peak memory active: 0.0GB
--> t5-small built.
built model with 60.506624M params
bf16 check passed

--> Running with mixed precision MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, keep_low_precision_grads=False, cast_forward_inputs=False, cast_root_forward_inputs=True, _module_classes_to_ignore=(<class 'torch.nn.modules.batchnorm._BatchNorm'>,)) policy
backward prefetch set to BackwardPrefetch.BACKWARD_PRE
sharding set to ShardingStrategy.FULL_SHARD
--> Batch Size = 16
vit, GPU peak memory allocation: 0.0GB, GPU peak memory reserved: 0.0GB, GPU peak memory active: 0.0GB
--> FSDP activation checkpointing in use
--> Torch.compile in use
local rank 0 init time = 2.1330501430202276
memory stats reset, ready to track
Running with AdamW optimizer, with fusion set to True
Epoch: 0 starting...
r0 Training Epoch:   0%|                                                                                                                                                                                                                                                                                                             | 0/814 [00:00<?, ?it/s][rank0]:[2023-07-12 16:06:38,552] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:45,098] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:50,665] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:52,371] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:53,787] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:06:55,185] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:03,010] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:06,595] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:09,035] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:11,628] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:13,970] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
[rank0]:[2023-07-12 16:07:16,581] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
r0 Training Epoch:   0%|▎                                                                                                                                                                                                                                                                                                 | 1/814 [01:00<13:33:31, 60.04s/it]step: 1: time taken for the last 1 steps is 0.08296773280017078, loss is 10.0
r0 Training Epoch:   0%|█                                                                                                                                                                                                                                                                                                  | 3/814 [01:00<3:31:17, 15.63s/it]step: 2: time taken for the last 1 steps is 0.09435613313689828, loss is 12.375
step: 3: time taken for the last 1 steps is 0.09597105905413628, loss is 12.4375
r0 Training Epoch:   1%|█▊                                                                                                                                                                                                                                                                                                 | 5/814 [01:00<1:43:04,  7.64s/it]step: 4: time taken for the last 1 steps is 0.09714369312860072, loss is 11.0625
step: 5: time taken for the last 1 steps is 0.09657704387791455, loss is 11.125
r0 Training Epoch:   1%|██▌                                                                                                                                                                                                                                                                                                  | 7/814 [01:00<59:49,  4.45s/it]step: 6: time taken for the last 1 steps is 0.0961470000911504, loss is 9.875
step: 7: time taken for the last 1 steps is 0.096938586095348, loss is 9.0625
r0 Training Epoch:   1%|██▊                                                                                                                                                                                                                                                                                                | 8/814 [01:00<1:42:02,  7.60s/it]
tracking_duration [60.03929250803776, 0.08296773280017078, 0.09435613313689828, 0.09597105905413628, 0.09714369312860072, 0.09657704387791455, 0.0961470000911504, 0.096938586095348]
** exit loop - rank 0 reporting....

--> cuda max reserved memory = 4.7949
--> max reserved percentage = 12.12 %

--> cuda max memory allocated = 3.7062
--> max allocated percentage = 9.37 %

--> peak active memory = 3.7375
--> peak active memory 9.45 %

cudaMalloc retries = 0
cuda OOM = 0


Validation loss data

Accuracy validation

--> Highest Val Accuracy =  0


--> Step avg speed (in seconds) based on -5 steps: -0.0
excluding 5 steps as warmup

--> Step avg speed based on 3 steps: 0.0966 seconds

Dist Training Framework used = FSDP

This was run with TensorParallel? = False

Run with Parallel Attention? False
Batch size used = 16

FSDP Activation Checkpointing? = True

--> Model Size =  60.506624 M Params
	******* loading model args.model='t5'
	--> World Size = 1

	--> Device_count = 2
	--> running with these defaults train_config(seed=2023, verbose=True, total_steps_to_run=8, warmup_steps=5, use_orig_params=True, limit_all_gathers=True, use_ddp=False, ddp_bucket_size=25, ddp_use_gradient_view=False, hf_t5_checkpointing=False, print_memory_summary=False, print_training_loss_data=False, num_epochs=4, model_weights_bf16=False, use_mixed_precision=True, use_low_precision_gradient_policy=False, use_tf32=True, optimizer='AdamW', ap_use_kahan_summation=False, sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, print_sharding_plan=False, run_profiler=False, profile_folder='fsdp/profile_tracing', log_every=1, num_workers_dataloader=2, batch_size_training=16, fsdp_activation_checkpointing=True, use_fused_attention=False, use_parallel_attention=False, run_validation=True, memory_report=True, nccl_debug_handler=True, distributed_debug=True, use_non_recursive_wrapping=False, use_synthetic_data=False, use_deferred_init=False, use_torch_compile=True, save_model_checkpoint=False, load_model_checkpoint=False, checkpoint_max_save_count=2, save_optimizer=False, load_optimizer=False, optimizer_checkpoint_file='Adam-t5--1.pt', checkpoint_model_filename='t5--1.pt')
	clearing gpu cache for all ranks
	--> running with torch dist debug set to detail
	--> total memory per gpu (GB) = 39.564
	wrapping policy is functools.partial(<function transformer_auto_wrap_policy at 0x7f862c480ca0>, transformer_layer_cls={<class 'transformers.models.t5.modeling_t5.T5Block'>})
	pokemon nor beans set not enabled
	Found cached dataset csv (/data/home/anijain/.cache/huggingface/datasets/csv/default-6c28f355c35f3029/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
	100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 306.56it/s]
	Found cached dataset csv (/data/home/anijain/.cache/huggingface/datasets/csv/default-6c28f355c35f3029/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d)
	100%\|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████\| 1/1 [00:00<00:00, 481.55it/s]

	--> Prepping t5-small model ...

	stats is ready....? _stats=defaultdict(<class 'list'>, {'best_accuracy': 0.0}), local_rank=0, rank=0
	*** building the model ****
	using deferred? False
	vit, GPU peak memory allocation: 0.0GB, GPU peak memory reserved: 0.0GB, GPU peak memory active: 0.0GB
	--> t5-small built.
	built model with 60.506624M params
	bf16 check passed

	--> Running with mixed precision MixedPrecision(param_dtype=torch.bfloat16, reduce_dtype=torch.bfloat16, buffer_dtype=torch.bfloat16, keep_low_precision_grads=False, cast_forward_inputs=False, cast_root_forward_inputs=True, _module_classes_to_ignore=(<class 'torch.nn.modules.batchnorm._BatchNorm'>,)) policy
	backward prefetch set to BackwardPrefetch.BACKWARD_PRE
	sharding set to ShardingStrategy.FULL_SHARD
	--> Batch Size = 16
	vit, GPU peak memory allocation: 0.0GB, GPU peak memory reserved: 0.0GB, GPU peak memory active: 0.0GB
	--> FSDP activation checkpointing in use
	--> Torch.compile in use
	local rank 0 init time = 2.1330501430202276
	memory stats reset, ready to track
	Running with AdamW optimizer, with fusion set to True
	Epoch: 0 starting...
	r0 Training Epoch: 0%\| \| 0/814 [00:00<?, ?it/s][rank0]:[2023-07-12 16:06:38,552] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:06:45,098] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:06:50,665] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:06:52,371] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:06:53,787] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:06:55,185] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:07:03,010] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:07:06,595] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:07:09,035] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:07:11,628] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:07:13,970] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	[rank0]:[2023-07-12 16:07:16,581] torch._inductor.utils: [WARNING] using triton random, expect difference from eager
	r0 Training Epoch: 0%\|▎ \| 1/814 [01:00<13:33:31, 60.04s/it]step: 1: time taken for the last 1 steps is 0.08296773280017078, loss is 10.0
	r0 Training Epoch: 0%\|█ \| 3/814 [01:00<3:31:17, 15.63s/it]step: 2: time taken for the last 1 steps is 0.09435613313689828, loss is 12.375
	step: 3: time taken for the last 1 steps is 0.09597105905413628, loss is 12.4375
	r0 Training Epoch: 1%\|█▊ \| 5/814 [01:00<1:43:04, 7.64s/it]step: 4: time taken for the last 1 steps is 0.09714369312860072, loss is 11.0625
	step: 5: time taken for the last 1 steps is 0.09657704387791455, loss is 11.125
	r0 Training Epoch: 1%\|██▌ \| 7/814 [01:00<59:49, 4.45s/it]step: 6: time taken for the last 1 steps is 0.0961470000911504, loss is 9.875
	step: 7: time taken for the last 1 steps is 0.096938586095348, loss is 9.0625
	r0 Training Epoch: 1%\|██▊ \| 8/814 [01:00<1:42:02, 7.60s/it]
	tracking_duration [60.03929250803776, 0.08296773280017078, 0.09435613313689828, 0.09597105905413628, 0.09714369312860072, 0.09657704387791455, 0.0961470000911504, 0.096938586095348]
	** exit loop - rank 0 reporting....

	--> cuda max reserved memory = 4.7949
	--> max reserved percentage = 12.12 %

	--> cuda max memory allocated = 3.7062
	--> max allocated percentage = 9.37 %

	--> peak active memory = 3.7375
	--> peak active memory 9.45 %

	cudaMalloc retries = 0
	cuda OOM = 0


	Validation loss data

	Accuracy validation

	--> Highest Val Accuracy = 0


	--> Step avg speed (in seconds) based on -5 steps: -0.0
	excluding 5 steps as warmup

	--> Step avg speed based on 3 steps: 0.0966 seconds

	Dist Training Framework used = FSDP

	This was run with TensorParallel? = False

	Run with Parallel Attention? False
	Batch size used = 16

	FSDP Activation Checkpointing? = True

	--> Model Size = 60.506624 M Params