{{ message }}

Instantly share code, notes, and snippets.

Last active Sep 11, 2021
 model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function loss = loss / accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass if (i+1) % accumulation_steps == 0: # Wait for several backward steps optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients tensors if (i+1) % evaluation_steps == 0: # Evaluate the model when we... evaluate_model() # ...have no gradients accumulated

### muaz-git commented Mar 27, 2019

 Thanks for the code. Can you please explain what does it mean by "Normalize our loss (if averaged)"?

### thomwolf commented May 4, 2019

 If you are using a loss which is averaged over the training samples (which is the case most of the time), you have to divide by the number of gradient accumulation steps

### Auth0rM0rgan commented May 19, 2019 • edited

 Hey @thomwolf, Thanks for the tips and tricks. Just one question: Shouldn't use `optimizer.zero_grad()` before `loss.backward()`?

### thomwolf commented May 19, 2019

 @Auth0rM0rgan No, otherwise you erase all the gradient accumulated in the leaves of the computation graphs.

### jaideep11061982 commented Jul 14, 2019

 hi thom.. if we have to simulate the case where we to process more number of images in a batch say due to GPU limitation i can keep batch size to be only 64 . But suppose i want to process 128 images . In that case shouldnt we do this way ``````total_loss=0 for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function if (i+1) % accumulation_steps == 0: # Wait for several backward steps total_loss = (total_loss+loss )/ accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients tensors total_loss=0 if (i+1) % evaluation_steps == 0: # Evaluate the model when we... evaluate_model() else : total_loss=loss+total_loss ``````

### jshin49 commented Nov 27, 2019

 Would gradient accumulation work for MAML training?

### meet-minimalist commented Jul 17, 2020

 Hey, Thanks for the code. I want to know how you handled batch normalization in gradient accumulation? E.g. If we use 8 sub-batch size and 4 iterations of forward passes and then accumulate gradient and backprop gradients. This will result in effective batch size of 32. But the problem with this setting is that in batch normalization layer, batch mean and batch variance for training are computed on batch of 8 at each forward pass for 4 times which are not the same as computing batch mean and batch variance on batch of 32 size. This stops simulating the same effect of training with 32 batch size. Please elaborate how you handled this. OR How to handle this? :( To verify this train a model with 32 batch size and train another one with 8-batch x 4-iteration accumulation strategy. For both the times, the weights and the data feeding should be identical. This way you will definitely encounter difference in loss.

### aGIToz commented Sep 4, 2020

 This still seems to be an open issue, I have not found any definitive answer to it. @meet-minimalist did you find a solution of handling BN with gradient accumulation /

### meet-minimalist commented Sep 5, 2020

 @aGIToz Sorry, I havent found exact solution as of now. But I know Tensorflow uses something like SyncBatchNorm between 8 individual TPUs when training on TPU Cluster. I dont know how they do it but they must have developed some workaround for this problem.

### CrystalWong2 commented Oct 29, 2020

 I rewrite the code in my project. But I found the loss increase! But if I trained in the normal way, the loss decreased. I have no idea.

### VpkPrasanna commented Mar 9, 2021

 Ho to find the gradient-accumulation_value , based on what we need to fix the value ?

### vaneshieh commented Jul 17, 2021

 hi thom.. if we have to simulate the case where we to process more number of images in a batch say due to GPU limitation i can keep batch size to be only 64 . But suppose i want to process 128 images . In that case shouldnt we do this way ``````total_loss=0 for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function if (i+1) % accumulation_steps == 0: # Wait for several backward steps total_loss = (total_loss+loss )/ accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients tensors total_loss=0 if (i+1) % evaluation_steps == 0: # Evaluate the model when we... evaluate_model() else : total_loss=loss+total_loss `````` Hi I'm using the similar idea. Do you have any further suggestions? Thanks