Syed Tousif Ahmed syed-ahmed

## gradient_accumulation.py
# For single-node, run this script via
# python -m torch.distributed.launch --nproc_per_node=<ngpus this node> example.py
#
# For multinode, see https://pytorch.org/docs/stable/distributed.html#launch-utility
#
# Example showing native mixed precision tools
# (torch.cuda.amp.GradScaler and torch.cuda.amp.autocast)
# used along with native DistributedDataParallel to perform
# gradient accumulation with allreduces only when stepping.
#

## nsight.sh
# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.

# https://developer.nvidia.com/nsight-systems
# https://docs.nvidia.com/nsight-systems/profiling/index.html

# My preferred nsys (command line executable used to create profiles) commands
#
# In your script, write
# torch.cuda.nvtx.range_push("region name")
# ...
	# For single-node, run this script via
	# python -m torch.distributed.launch --nproc_per_node=<ngpus this node> example.py
	#
	# For multinode, see https://pytorch.org/docs/stable/distributed.html#launch-utility
	#
	# Example showing native mixed precision tools
	# (torch.cuda.amp.GradScaler and torch.cuda.amp.autocast)
	# used along with native DistributedDataParallel to perform
	# gradient accumulation with allreduces only when stepping.
	#
	# This isn't supposed to run as a bash script, i named it with ".sh" for syntax highlighting.

	# https://developer.nvidia.com/nsight-systems
	# https://docs.nvidia.com/nsight-systems/profiling/index.html

	# My preferred nsys (command line executable used to create profiles) commands
	#
	# In your script, write
	# torch.cuda.nvtx.range_push("region name")
	# ...