HudsonHuang/Pytorch performance guide.md

## Pytorch performance guide.md

      
    Raw
  

              Pytorch performance guide.md
            
          
Using CUDA in correct way：


确定性卷积：（把所有操作的seed=0，以便重现，会变慢）
torch.backends.cudnn.deterministic
https://oldpan.me/archives/pytorch-conmon-problem-in-training
添加torch.cuda.get_device_name和torch.cuda.get_device_capability实现如下功能。例：


torch.cuda.get_device_name(0)
'Quadro GP100'
torch.cuda.get_device_capability(0)
(6, 0)
如果设置torch.backends.cudnn.deterministic = True，则CuDNN卷积使用确定性算法
torch.cuda_get_rng_state_all并torch.cuda_set_rng_state_all引入，让您一次保存/加载随机数生成器的状态在所有GPU上
torch.cuda.emptyCache()释放PyTorch的缓存分配器中的缓存内存块。当与其他进程共享GPU时，这是有用的长期运行ipython笔记本。


https://www.zhihu.com/question/67209417/answer/303290223
训练模型个人的基本要求是deterministic/reproducible，或者说是可重复性。也就是说在随机种子固定的情况下，每次训练出来的模型要一样。之前遇到了两次不可重复的情况。第一次是训练CNN的时候，发现每次跑出来小数点后几位会有不一样。epoch越多，误差就越多，虽然结果大致上一样，但是强迫症真的不能忍。后来发现在0.3.0的时候已经修复了这个问题，可以用torch.backends.cudnn.deterministic = True
这样调用的CuDNN的卷积操作就是每次一样的了。


torch.backends.cudnn.benchmark = True
使用benchmark以启动CUDNN_FIND自动寻找最快的操作，当计算图不会改变的时候（每次输入形状相同，模型不改变）的情况下可以提高性能，反之则降低性能
pytorch/pytorch#3265 (comment)


More
https://www.ptorch.com/news/94.html


优化建议
https://www.sagivtech.com/2017/09/19/optimizing-pytorch-training-code/


pin_memory + non_blocking async GPU training


https://github.com/pytorch/examples/blob/master/imagenet/main.py#L95


https://pytorch.org/docs/stable/notes/cuda.html?highlight=non_blocking
non_blocking需要对train data设置，0.4.0版本中的DataParallel会自动尝试用async GPU training


用Variable：


Variable() volatile=True
当前版本已经默认variable了吗？print一下emun出来的data看看


用DistributedDataParallel代替DataParallel
这里引入了一个新的函数model = torch.nn.parallel.DistributedDataParallel(model)为的就是支持分布式模式
不同于原来在multiprocessing中的model = torch.nn.DataParallel(model,device_ids=[0,1,2,3]).cuda()函数，这个函数只是实现了在单机上的多GPU训练，根据官方文档的说法，甚至在单机多卡的模式下，新函数表现也会优于这个旧函数。
parser.add_argument('--dist-url', default='tcp://172.27.149.6:7777', type=str,
help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='gloo', type=str,
help='distributed backend')
import torch.distributed as dist
import torch.utils.data.distributed
dist.init_process_group(backend='gloo, init_method='tcp://172.27.149.6:7777',
world_size=args.world_size)
torch.utils.data.distributed.DistributedSampler(train_dataset)
, sampler=train_sampler


使用比Adam更快的优化器
SGD with Momentum
AdamW or Adam with correct weight decay（还未发布）


Nvidia Apex 16位浮点数扩展
https://nvidia.github.io/apex/index.html
需要：
CUDA9
Python 3
Pytorch 0.4.0


把datasetloader中的num_worker=0
我实际的经验，用单机多卡模式的时候，这个参数越大，GPU越占不满，速度越慢