Last updated: November 3, 2025
This guide is about training nanochat on 2 DGX Sparks linked via QSFP/CX7. I estimate the training to take about 5 days (about half the training time on 1 DGX Spark).
Follow the following NVIDIA tutorials to link and test your Spark cluster:
- Stack 2 Sparks: https://build.nvidia.com/spark/connect-two-sparks/stacked-sparks
- Test NCCL: https://build.nvidia.com/spark/nccl
Then, make sure that you can train nanochat on each Spark separately (i.e., single-node training). There are a few different ways to set up single-node training, here is how I do it. The rest of the tutorial assumes you are training it the way I do (inside the NVIDIA PyTorch Docker container).
The steps below assume that you are in the nanochat/ folder cloned from Github, and have tested single-node training.
So the training data is already present in ~/.cache/nanochat. Designate one of Sparks as the main Spark and run the following steps on each Spark.
1. Set up the networking environment
On the main Spark, run the following and note the MASTER_ADDR (you will need it on the worker Spark):
NODE_RANK=0
IB_IF=$(/usr/sbin/ibdev2netdev | awk '/(Up|ACTIVE)/{print $5; exit}')
MASTER_ADDR=$(ip -o -4 addr show dev "$IB_IF" | awk '{print $4}' | cut -d/ -f1)
cat > ib_env.export <<EOF
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=29500
export NODE_RANK=$NODE_RANK
export NCCL_SOCKET_IFNAME=$IB_IF
EOF
echo "MASTER_ADDR=$MASTER_ADDR MASTER_PORT=29500 NODE_RANK=$NODE_RANK IFACE=$IB_IF"
On my main Spark, this prints: MASTER_ADDR=169.254.69.64 MASTER_PORT=29500 NODE_RANK=0 IFACE=enp1s0f0np0
On the worker Spark, run the following after copying in the MASTER_ADDR from above (I have copied in mine below):
NODE_RANK=1
IB_IF=$(/usr/sbin/ibdev2netdev | awk '/(Up|ACTIVE)/{print $5; exit}')
MASTER_ADDR="169.254.69.64" # FIXME based on your main Spark's MASTER_ADDR
cat > ib_env.export <<EOF
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=29500
export NODE_RANK=$NODE_RANK
export NCCL_SOCKET_IFNAME=$IB_IF
EOF
echo "MASTER_ADDR=$MASTER_ADDR MASTER_PORT=29500 NODE_RANK=$NODE_RANK IFACE=$IB_IF"
On my other Spark, this prints: MASTER_ADDR=169.254.69.64 MASTER_PORT=29500 NODE_RANK=1 IFACE=enp1s0f0np0
2. Pull and start the Docker container
# run from within the nanochat directory on each node
# run from within the nanochat directory on each node
docker pull nvcr.io/nvidia/pytorch:25.09-py3
docker run --gpus all -it --rm \
--ipc=host --network=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
--device=/dev/infiniband \
-v $HOME/.cache/nanochat:/root/.cache/nanochat \
-v ${PWD}:/workspace -w /workspace \
nvcr.io/nvidia/pytorch:25.09-py3
3. Setup the Docker container and start training
Run the following inside the Docker container you just started on each Spark.
pip install -U pandas pyarrow wandb tokenizers tiktoken
source /workspace/ib_env.export
torchrun \
--nproc_per_node=1 \
--nnodes=2 \
--node_rank=${NODE_RANK} \
--master_addr="${MASTER_ADDR}" \
--master_port="${MASTER_PORT}" \
-m scripts.base_train -- \
--depth=20 \
--device_batch_size=32 \
--run=nanochat-2spark
Training will not begin until you have run this on both Sparks. On the worker Spark, you may temporarily see a warning that looks like [W1104 02:48:45.504862288 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3. This will go away as soon as it can talk to the main Spark.
On the main Spark, if everything has been set up correctly, you will see the following:
Vocab size: 65,536
num_layers: 20
model_dim: 1280
num_heads: 10
num_kv_heads: 10
Tokens / micro-batch / rank: 32 x 2048 = 65,536
Tokens / micro-batch: 131,072
Total batch size 524,288 => gradient accumulation steps: 4
Number of parameters: 560,988,160
Estimated FLOPs per token: 3.491758e+09
Calculated number of iterations from target data:param ratio: 21,400
Total number of training tokens: 11,219,763,200
Tokens : Params ratio: 20.00
Total training FLOPs estimate: 3.917670e+19
Scaling the LR for the AdamW parameters ∝1/√(1280/768) = 0.774597
Muon: Grouping 80 params of shape torch.Size([1280, 1280]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([1280, 5120]), device cuda:0, dtype torch.float32
Muon: Grouping 20 params of shape torch.Size([5120, 1280]), device cuda:0, dtype torch.float32
step 00000/21400 (0.00%) | loss: 11.090355 | lrm: 1.00 | dt: 28961.72ms | tok/sec: 18,102 | mfu: 3.20 | total time: 0.00m
step 00001/21400 (0.00%) | loss: 10.845189 | lrm: 1.00 | dt: 19925.67ms | tok/sec: 26,312 | mfu: 4.64 | total time: 0.00m
step 00002/21400 (0.01%) | loss: 10.234355 | lrm: 1.00 | dt: 19956.55ms | tok/sec: 26,271 | mfu: 4.64 | total time: 0.00m
...
You may notice that it takes forever to see the first iteration. That's because of the call to evaluate_bpb in step 0. I fixed this by changing line 186 in scripts/base_train.py from if last_step or step % eval_every == 0: to if last_step or (step % eval_every == 0 and step > 0):. This prevents evaluate_bpb from being called in step 0.