Skip to content

Instantly share code, notes, and snippets.

@gautierdag
Last active September 25, 2023 02:27
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save gautierdag/925760d4295080c1860259dba43e4c01 to your computer and use it in GitHub Desktop.
Save gautierdag/925760d4295080c1860259dba43e4c01 to your computer and use it in GitHub Desktop.
pytorch-lightning Transformer (noam) lr policy
import torch
import pytorch_lightning as pl
class MyTransformer(pl.LightningModule):
def __init__(
self,
learning_rate=0.001,
warmup=4000,
):
self.learning_rate = learning_rate
self.warmup = warmup
#
# This is an example config on how to do the Noam (Attention is all you need) lr policy
# with pytorch lightning.
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
def warm_decay(step):
if step < self.warmup:
return step / self.warmup
return self.warmup ** 0.5 * step ** -0.5
scheduler = (
{
"scheduler": torch.optim.lr_scheduler.LambdaLR(optimizer, warm_decay),
"interval": "step", #runs per batch rather than per epoch
"frequency": 1,
# "name" : "learning_rate" # uncomment if using LearningRateMonitor
}
)
return [optimizer], [scheduler]
@toto6038
Copy link

Appreciated. But the snippet seems to ignore the hidden size of Transformer.

@gautierdag
Copy link
Author

gautierdag commented Sep 22, 2023

Appreciated. But the snippet seems to ignore the hidden size of Transformer.

This is a minimal code snippet specifically written to show the configure_optimizers function for Noam, it’s not a runnable solution.

If you want to pass a specific transformer, I’m sure you can figure out how to modify the class init and set self.model to the transformer of your choice.

Any PyTorch lightning module would work, all the parameters are passed to the optimiser in:

optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)

@toto6038
Copy link

toto6038 commented Sep 24, 2023

Please refer to espnet's implementation. It mentioned that the Noam scheduler used in Attention is All You Need uses a parameter model_size to calculate learning rate, which is the hidden size mentioned in my last comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment