To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property
alignment_heads must be added to the
WhisperConfig object. This is a list of
[layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.
If your Whisper checkpoint does not have the
alignment_heads property yet, it can be added in two possible ways.
Method 1. Change the
# load the model model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")