Skip to content

Instantly share code, notes, and snippets.

View mcarilli's full-sized avatar

Michael Carilli mcarilli

View GitHub Profile
@mcarilli
mcarilli / gradient_accumulation.py
Last active June 30, 2023 12:21
Minimal example of gradient accumulation, allreducing only on step() iterations and interacting properly with torch.cuda.amp
# For single-node, run this script via
# python -m torch.distributed.launch --nproc_per_node=<ngpus this node> example.py
#
# For multinode, see https://pytorch.org/docs/stable/distributed.html#launch-utility
#
# Example showing native mixed precision tools
# (torch.cuda.amp.GradScaler and torch.cuda.amp.autocast)
# used along with native DistributedDataParallel to perform
# gradient accumulation with allreduces only when stepping.
#
@mcarilli
mcarilli / nsight.sh
Last active June 25, 2024 12:46
Favorite nsight systems profiling commands for Pytorch scripts
# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.
# https://developer.nvidia.com/nsight-systems
# https://docs.nvidia.com/nsight-systems/profiling/index.html
# My preferred nsys (command line executable used to create profiles) commands
#
# In your script, write
# torch.cuda.nvtx.range_push("region name")
# ...
@mcarilli
mcarilli / Closure_Handling.md
Last active January 18, 2023 03:21
Automatic mixed precision for Pytorch: supplementary information

Typical closure invocation (without gradient scaling) looks like

for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    loss = optimizer.step(closure)
@mcarilli
mcarilli / test cpp extensions build
Created January 29, 2019 18:25
Build output for python setup.py install in pytorch/test/cpp_extensions
(pytorch_master) cpp_extensions$ python setup.py install
running install
running bdist_egg
running egg_info
creating torch_test_cpp_extension.egg-info
writing torch_test_cpp_extension.egg-info/PKG-INFO
writing dependency_links to torch_test_cpp_extension.egg-info/dependency_links.txt
writing top-level names to torch_test_cpp_extension.egg-info/top_level.txt
writing manifest file 'torch_test_cpp_extension.egg-info/SOURCES.txt'
reading manifest file 'torch_test_cpp_extension.egg-info/SOURCES.txt'
@mcarilli
mcarilli / output.txt
Created January 18, 2019 17:12
python setup.py install --cuda_ext --cpp_ext
(pytorch_master) apex$ python setup.py install --cpp_ext --cuda_ext
torch.__version__ = 1.0.0a0+096ee84
running install
running bdist_egg
running egg_info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing top-level names to apex.egg-info/top_level.txt
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'
@mcarilli
mcarilli / gist_flattening.md
Last active March 13, 2019 15:48
Example of flattening parameter groups in conjunction with Amp

This script uses the deprecated Amp API. If you made it here, don't use it as an example, I'm only keeping it around for my own reference.

This example is based on main_amp.py from the Apex imagenet amp examples. It demonstrates parameter flattening in conjuction with Amp, which can substantially improve performance for some networks.

Ctrl+f "For param flattening" in main_amp_replay.py below to see what was changed.

Vimdiffing main_amp_replay.py and the original main_amp.py from the Apex examples is also instructive.

@mcarilli
mcarilli / gist_replay.md
Last active July 31, 2020 00:00
Example of batch replay with Amp opt_level=O1 + dynamic gradient scaling

This example is based on main_amp.py from the Apex imagenet amp examples and can be used with the same example commands. It demonstrates batch replay (instead of batch skipping) with the dynamic gradient scaling used by Amp.

Batch replay requires a bit of user-side control flow, but is fairly straightforward.

Ctrl+f "added for batch replay" in main_amp_replay.py below to see what was changed. There should only be 5 instances, found entirely in this section.

Vimdiffing main_amp_replay.py and main_amp.py from the Apex example directory is also instructive. Again, there should be few differences.

See the "Batch replay" example in the Automatic Mixed Precision RFC for a preview of how I plan this will wor

@mcarilli
mcarilli / commands.md
Last active June 11, 2024 20:13
Single- and multiprocess profiling workflow with nvprof and NVVP (Nsight Systems coming soon...)

Ordinary launch commands (no profiling):

Single-process:

python main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/

Multi-process:

python -m torch.distributed.launch  --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/
python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 -b 32 --epochs=1 --workers 4 -p 10 --fp16 --prof 100 --deterministic ..
nvprof --profile-from-start off --profile-child-processes -fo %p.nvprof python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 -b 32 --epochs=1 --workers 4 -p 10 --fp16 --prof 100 --deterministic ..