Skip to content

Instantly share code, notes, and snippets.

View syed-ahmed's full-sized avatar

Syed Tousif Ahmed syed-ahmed

View GitHub Profile
@mcarilli
mcarilli / gradient_accumulation.py
Last active June 30, 2023 12:21
Minimal example of gradient accumulation, allreducing only on step() iterations and interacting properly with torch.cuda.amp
# For single-node, run this script via
# python -m torch.distributed.launch --nproc_per_node=<ngpus this node> example.py
#
# For multinode, see https://pytorch.org/docs/stable/distributed.html#launch-utility
#
# Example showing native mixed precision tools
# (torch.cuda.amp.GradScaler and torch.cuda.amp.autocast)
# used along with native DistributedDataParallel to perform
# gradient accumulation with allreduces only when stepping.
#
@mcarilli
mcarilli / nsight.sh
Last active April 9, 2024 08:28
Favorite nsight systems profiling commands for Pytorch scripts
# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.
# https://developer.nvidia.com/nsight-systems
# https://docs.nvidia.com/nsight-systems/profiling/index.html
# My preferred nsys (command line executable used to create profiles) commands
#
# In your script, write
# torch.cuda.nvtx.range_push("region name")
# ...