Michael Carilli mcarilli

## gradient_accumulation.py
# For single-node, run this script via
# python -m torch.distributed.launch --nproc_per_node=<ngpus this node> example.py
#
# For multinode, see https://pytorch.org/docs/stable/distributed.html#launch-utility
#
# Example showing native mixed precision tools
# (torch.cuda.amp.GradScaler and torch.cuda.amp.autocast)
# used along with native DistributedDataParallel to perform
# gradient accumulation with allreduces only when stepping.
#

## nsight.sh
# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.

# https://developer.nvidia.com/nsight-systems
# https://docs.nvidia.com/nsight-systems/profiling/index.html

# My preferred nsys (command line executable used to create profiles) commands
#
# In your script, write
# torch.cuda.nvtx.range_push("region name")
# ...

## Closure_Handling.md

      
              4 files
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                mcarilli
                / Closure_Handling.md
            
            
              Last active
              January 18, 2023 03:21
            
              
                Automatic mixed precision for Pytorch: supplementary information
              
          
    Typical closure invocation (without gradient scaling) looks like
for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    loss = optimizer.step(closure)

  
## test cpp extensions build
(pytorch_master) cpp_extensions$ python setup.py install
running install
running bdist_egg
running egg_info
creating torch_test_cpp_extension.egg-info
writing torch_test_cpp_extension.egg-info/PKG-INFO
writing dependency_links to torch_test_cpp_extension.egg-info/dependency_links.txt
writing top-level names to torch_test_cpp_extension.egg-info/top_level.txt
writing manifest file 'torch_test_cpp_extension.egg-info/SOURCES.txt'
reading manifest file 'torch_test_cpp_extension.egg-info/SOURCES.txt'

## output.txt
(pytorch_master) apex$ python setup.py install --cpp_ext --cuda_ext
torch.__version__  =  1.0.0a0+096ee84
running install
running bdist_egg
running egg_info
writing apex.egg-info/PKG-INFO
writing dependency_links to apex.egg-info/dependency_links.txt
writing top-level names to apex.egg-info/top_level.txt
reading manifest file 'apex.egg-info/SOURCES.txt'
writing manifest file 'apex.egg-info/SOURCES.txt'

## gist_flattening.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                mcarilli
                / gist_flattening.md
            
            
              Last active
              March 13, 2019 15:48
            
              
                Example of flattening parameter groups in conjunction with Amp
              
          
    This script uses the deprecated Amp API. If you made it here, don't use it as an example, I'm only keeping it around for my own reference.

This example is based on main_amp.py from the Apex imagenet amp examples.  It demonstrates parameter flattening in conjuction with Amp, which can substantially improve performance for some networks.
Ctrl+f "For param flattening" in main_amp_replay.py below to see what was changed.
Vimdiffing main_amp_replay.py and the original main_amp.py from the Apex examples is also instructive.

  
## gist_replay.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                mcarilli
                / gist_replay.md
            
            
              Last active
              July 31, 2020 00:00
            
              
                Example of batch replay with Amp opt_level=O1 + dynamic gradient scaling
              
          
    This example is based on main_amp.py from the Apex imagenet amp examples and can be used with the same example commands.  It demonstrates batch replay (instead of batch skipping) with the dynamic gradient scaling used by Amp.
Batch replay requires a bit of user-side control flow, but is fairly straightforward.
Ctrl+f "added for batch replay" in main_amp_replay.py below to see what was changed.  There should only be 5 instances, found entirely in this section.
Vimdiffing main_amp_replay.py and main_amp.py from the Apex example directory is also instructive.  Again, there should be few differences.
See the "Batch replay" example in the Automatic Mixed Precision RFC for a preview of how I plan this will wor

  
## commands.md

      
              2 files
            
          
              5 forks
            
          
              4 comments
            
          
              21 stars
            
          
                mcarilli
                / commands.md
            
            
              Last active
              June 11, 2024 20:13
            
              
                Single- and multiprocess profiling workflow with nvprof and NVVP (Nsight Systems coming soon...)
              
          
    Ordinary launch commands (no profiling):

Single-process:
python main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/

Multi-process:
python -m torch.distributed.launch  --nproc_per_node=2 main_amp.py -a resnet50 --b 224 --deterministic --workers 4 --opt-level O1 ./bare_metal_train_val/


## gist:03b22d26422db15f577351d5710e4c1e
python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 -b 32 --epochs=1 --workers 4 -p 10 --fp16 --prof 100 --deterministic ..

nvprof --profile-from-start off --profile-child-processes -fo %p.nvprof python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 -b 32 --epochs=1 --workers 4 -p 10 --fp16 --prof 100 --deterministic ..
	# For single-node, run this script via
	# python -m torch.distributed.launch --nproc_per_node=<ngpus this node> example.py
	#
	# For multinode, see https://pytorch.org/docs/stable/distributed.html#launch-utility
	#
	# Example showing native mixed precision tools
	# (torch.cuda.amp.GradScaler and torch.cuda.amp.autocast)
	# used along with native DistributedDataParallel to perform
	# gradient accumulation with allreduces only when stepping.
	#
	# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.

	# https://developer.nvidia.com/nsight-systems
	# https://docs.nvidia.com/nsight-systems/profiling/index.html

	# My preferred nsys (command line executable used to create profiles) commands
	#
	# In your script, write
	# torch.cuda.nvtx.range_push("region name")
	# ...
	(pytorch_master) cpp_extensions$ python setup.py install
	running install
	running bdist_egg
	running egg_info
	creating torch_test_cpp_extension.egg-info
	writing torch_test_cpp_extension.egg-info/PKG-INFO
	writing dependency_links to torch_test_cpp_extension.egg-info/dependency_links.txt
	writing top-level names to torch_test_cpp_extension.egg-info/top_level.txt
	writing manifest file 'torch_test_cpp_extension.egg-info/SOURCES.txt'
	reading manifest file 'torch_test_cpp_extension.egg-info/SOURCES.txt'
	(pytorch_master) apex$ python setup.py install --cpp_ext --cuda_ext
	torch.__version__ = 1.0.0a0+096ee84
	running install
	running bdist_egg
	running egg_info
	writing apex.egg-info/PKG-INFO
	writing dependency_links to apex.egg-info/dependency_links.txt
	writing top-level names to apex.egg-info/top_level.txt
	reading manifest file 'apex.egg-info/SOURCES.txt'
	writing manifest file 'apex.egg-info/SOURCES.txt'
	python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 -b 32 --epochs=1 --workers 4 -p 10 --fp16 --prof 100 --deterministic ..

	nvprof --profile-from-start off --profile-child-processes -fo %p.nvprof python -m torch.distributed.launch --nproc_per_node=2 main.py -a resnet50 -b 32 --epochs=1 --workers 4 -p 10 --fp16 --prof 100 --deterministic ..