Skip to content

Instantly share code, notes, and snippets.

@ananthsub
ananthsub / gist:a3805c4584b11c8f92e5941559f7adf6
Created September 20, 2025 01:23
nemotron_h_bf16_with_fp8_current_scaling_mixed + tp comm overlap stacktrace
[rank7]: File "/opt/Megatron-Bridge/src/megatron/bridge/training/pretrain.py", line 63, in pretrain
[rank7]: train(
[rank7]: File "/opt/Megatron-Bridge/src/megatron/bridge/training/train.py", line 277, in train
[rank7]: loss_dict, skipped_iter, should_checkpoint, should_exit, exit_code, grad_norm, num_zeros_in_grad = train_step(
[rank7]: ^^^^^^^^^^^
[rank7]: File "/opt/Megatron-Bridge/src/megatron/bridge/training/train.py", line 520, in train_step
[rank7]: losses_reduced = forward_backward_func(
[rank7]: ^^^^^^^^^^^^^^^^^^^^^^
[rank7]: File "/opt/megatron-lm/megatron/core/pipeline_parallel/schedules.py", line 600, in forward_backward_no_pipelining
[rank7]: output_tensor, num_tokens = forward_step(
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/mnt/xarfuse/uid-30415/c18376c8-seed-f10abbe6-4964-43d9-a723-5ecbc73379c7-ns-4026534030/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/mnt/xarfuse/uid-30415/c18376c8-seed-f10abbe6-4964-43d9-a723-5ecbc73379c7-ns-4026534030/torchelastic/agent/server/local_elastic_agent.py", line
101, in _wrap
ret = fn(*args)
File "/mnt/xarfuse/uid-30415/c18376c8-seed-f10abbe6-4964-43d9-a723-5ecbc73379c7-ns-4026534030/f6/stdlib/sample_projects/classy_hydra_project/main.py"
, line 32, in train
trainer.fit(task)