Skip to content

Instantly share code, notes, and snippets.

@jeongukjae
Created July 15, 2020 07:50
Show Gist options
  • Save jeongukjae/dc2f815a7d40f243f4e74cc8e4b203ba to your computer and use it in GitHub Desktop.
Save jeongukjae/dc2f815a7d40f243f4e74cc8e4b203ba to your computer and use it in GitHub Desktop.
check gradient clipping using model.parameters() vs apex.amp.master_params(optimizer)
import torch, copy, apex
print("initialize model")
model_1 = torch.nn.Linear(1000, 2000).cuda()
model_2 = torch.nn.Linear(1000, 2000).cuda()
model_3 = torch.nn.Linear(1000, 2000).cuda()
model_1.weight = torch.nn.Parameter(model_3.weight.clone())
model_2.weight = torch.nn.Parameter(model_3.weight.clone())
optimizer_1 = torch.optim.SGD(model_1.parameters(), lr=1e2)
optimizer_2 = torch.optim.SGD(model_2.parameters(), lr=1e2)
optimizer_3 = torch.optim.SGD(model_3.parameters(), lr=1e2)
optimizer_1.load_state_dict(optimizer_3.state_dict())
optimizer_2.load_state_dict(optimizer_3.state_dict())
criterion = torch.nn.CrossEntropyLoss()
random_input = torch.rand(500, 1000).cuda()
target_input = torch.empty(500, dtype=torch.long).random_(2000).cuda()
amp_model1, amp_optimizer1 = apex.amp.initialize(model_1, optimizer_1, opt_level="O1")
amp_model2, amp_optimizer2 = apex.amp.initialize(model_2, optimizer_2, opt_level="O1")
amp_model3, amp_optimizer3 = apex.amp.initialize(model_3, optimizer_3, opt_level="O1")
print(amp_model1.weight - amp_model2.weight, torch.mean(torch.abs(amp_model1.weight - amp_model2.weight)))
print(amp_model3.weight - amp_model2.weight, torch.mean(torch.abs(amp_model3.weight - amp_model2.weight)))
for model, optimizer in [(amp_model1, amp_optimizer1), (amp_model2, amp_optimizer2), (amp_model3, amp_optimizer3)]:
optimizer.zero_grad()
output = model(random_input)
loss = torch.nn.CrossEntropyLoss()(output, target_input)
with apex.amp.scale_loss(loss, optimizer) as scaled_loss:
print("Scaled_Loss:", scaled_loss)
scaled_loss.backward()
print("Loss: ", loss)
if model == amp_model1:
print("Clip using torch")
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.02)
elif model == amp_model2:
print("Clip using apex master params")
torch.nn.utils.clip_grad_norm_(apex.amp.master_params(optimizer), 0.02)
else:
print("Dont clip")
optimizer.step()
print("Step Optimizer")
print(amp_model1.weight - amp_model2.weight, torch.mean(torch.abs(amp_model1.weight - amp_model2.weight)))
print(amp_model3.weight - amp_model2.weight, torch.mean(torch.abs(amp_model3.weight - amp_model2.weight)))
print(amp_model1.weight.grad - amp_model2.weight.grad, torch.mean(torch.abs(amp_model1.weight.grad - amp_model2.weight.grad)))
print(amp_model3.weight.grad - amp_model2.weight.grad, torch.mean(torch.abs(amp_model3.weight.grad - amp_model2.weight.grad)))
@jeongukjae
Copy link
Author

output:

$ python amp-clipping.py
initialize model
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0',
       grad_fn=<SubBackward0>) tensor(0., device='cuda:0', grad_fn=<MeanBackward0>)
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0',
       grad_fn=<SubBackward0>) tensor(0., device='cuda:0', grad_fn=<MeanBackward0>)
Scaled_Loss: tensor(501821.8125, device='cuda:0', grad_fn=<MulBackward0>)
Loss:  tensor(7.6572, device='cuda:0', grad_fn=<NllLossBackward>)
Clip using torch
Step Optimizer
Scaled_Loss: tensor(501945.5625, device='cuda:0', grad_fn=<MulBackward0>)
Loss:  tensor(7.6591, device='cuda:0', grad_fn=<NllLossBackward>)
Clip using apex master params
Step Optimizer
Scaled_Loss: tensor(501877.0938, device='cuda:0', grad_fn=<MulBackward0>)
Loss:  tensor(7.6580, device='cuda:0', grad_fn=<NllLossBackward>)
Dont clip
Step Optimizer
tensor([[ 9.0338e-07,  1.4622e-06,  1.4771e-06,  ...,  8.8476e-07,
          1.4789e-06,  1.4603e-06],
        [ 1.2871e-06,  1.2722e-06,  1.2815e-06,  ...,  1.2750e-06,
          1.2815e-06,  1.2703e-06],
        [ 1.0019e-05,  1.0877e-05,  1.0302e-05,  ...,  1.0299e-05,
          1.0887e-05,  1.0875e-05],
        ...,
        [-5.1446e-06, -5.4576e-06, -5.1558e-06,  ..., -5.4520e-06,
         -5.4445e-06, -5.4594e-06],
        [-5.4268e-06, -5.2461e-06, -5.9819e-06,  ..., -3.6880e-06,
         -4.9612e-06, -5.4148e-06],
        [-3.5234e-05, -3.8170e-05, -3.6408e-05,  ..., -3.7581e-05,
         -3.6405e-05, -3.8756e-05]], device='cuda:0', grad_fn=<SubBackward0>) tensor(1.2495e-05, device='cuda:0', grad_fn=<MeanBackward0>)
tensor([[-0.0316, -0.0347, -0.0327,  ..., -0.0339, -0.0325, -0.0347],
        [-0.0207, -0.0225, -0.0213,  ..., -0.0220, -0.0213, -0.0227],
        [-0.0240, -0.0263, -0.0250,  ..., -0.0255, -0.0249, -0.0265],
        ...,
        [-0.0219, -0.0243, -0.0231,  ..., -0.0236, -0.0227, -0.0245],
        [ 0.0511,  0.0731,  0.1258,  ...,  0.1205,  0.0367,  0.0170],
        [-0.0257, -0.0278, -0.0264,  ..., -0.0271, -0.0261, -0.0279]],
       device='cuda:0', grad_fn=<SubBackward0>) tensor(0.0385, device='cuda:0', grad_fn=<MeanBackward0>)
tensor([[-9.0349e-09, -1.4618e-08, -1.4780e-08,  ..., -8.8376e-09,
         -1.4799e-08, -1.4612e-08],
        [-1.2863e-08, -1.2712e-08, -1.2819e-08,  ..., -1.2755e-08,
         -1.2813e-08, -1.2698e-08],
        [-1.0019e-07, -1.0877e-07, -1.0303e-07,  ..., -1.0299e-07,
         -1.0888e-07, -1.0874e-07],
        ...,
        [ 5.1450e-08,  5.4573e-08,  5.1553e-08,  ...,  5.4520e-08,
          5.4443e-08,  5.4590e-08],
        [ 5.4262e-08,  5.2465e-08,  5.9816e-08,  ...,  3.6902e-08,
          4.9613e-08,  5.4148e-08],
        [ 3.5234e-07,  3.8170e-07,  3.6407e-07,  ...,  3.7581e-07,
          3.6406e-07,  3.8755e-07]], device='cuda:0') tensor(1.2495e-07, device='cuda:0')
tensor([[ 0.0003,  0.0003,  0.0003,  ...,  0.0003,  0.0003,  0.0003],
        [ 0.0002,  0.0002,  0.0002,  ...,  0.0002,  0.0002,  0.0002],
        [ 0.0002,  0.0003,  0.0002,  ...,  0.0003,  0.0002,  0.0003],
        ...,
        [ 0.0002,  0.0002,  0.0002,  ...,  0.0002,  0.0002,  0.0002],
        [-0.0005, -0.0007, -0.0013,  ..., -0.0012, -0.0004, -0.0002],
        [ 0.0003,  0.0003,  0.0003,  ...,  0.0003,  0.0003,  0.0003]],
       device='cuda:0') tensor(0.0004, device='cuda:0')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment