Skip to content

Instantly share code, notes, and snippets.

@hollance
Last active December 5, 2023 13:25
Show Gist options
  • Save hollance/42e32852f24243b748ae6bc1f985b13a to your computer and use it in GitHub Desktop.
Save hollance/42e32852f24243b748ae6bc1f985b13a to your computer and use it in GitHub Desktop.
Alignment heads for Whisper word-level timestamps with Hugging Face Transformers

To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads must be added to the GenerationConfig object. This is a list of [layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.

If your Whisper checkpoint does not have the alignment_heads property yet, it can be added in two possible ways.

Method 1. Change the model.generation_config property:

# load the model
model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")

# set the new property
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

Method 2. Add a new line to the generation_config.json file:

"alignment_heads": [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]],

After you're done, use push_to_hub to make these changes permanent:

model.push_to_hub("your_pretrained_checkpoint", use_auth_token="your_token_if_not_logged_in", create_pr=True)

The correct values for alignment_heads depend on the size of the model. Here are the appropriate values for the different Whisper model sizes. These are taken from the OpenAI checkpoints. If you fine-tuned your own checkpoint, you may need to inspect the cross-attention weights to find the appropriate layers and attention heads.

whisper-tiny: [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

whisper-tiny.en: [[1, 0], [2, 0], [2, 5], [3, 0], [3, 1], [3, 2], [3, 3], [3, 4]]

whisper-base: [[3, 1], [4, 2], [4, 3], [4, 7], [5, 1], [5, 2], [5, 4], [5, 6]]

whisper-base.en: [[3, 3], [4, 7], [5, 1], [5, 5], [5, 7]]

whisper-small: [[5, 3], [5, 9], [8, 0], [8, 4], [8, 7], [8, 8], [9, 0], [9, 7], [9, 9], [10, 5]]

whisper-small.en: [[6, 6], [7, 0], [7, 3], [7, 8], [8, 2], [8, 5], [8, 7], [9, 0], [9, 4], [9, 8], [9, 10], [10, 0], [10, 1], [10, 2], [10, 3], [10, 6], [10, 11], [11, 2], [11, 4]]

whisper-medium: [[13, 15], [15, 4], [15, 15], [16, 1], [20, 0], [23, 4]]

whisper-medium.en: [[11, 4], [14, 1], [14, 12], [14, 14], [15, 4], [16, 0], [16, 4], [16, 9], [17, 12], [17, 14], [18, 7], [18, 10], [18, 15], [20, 0], [20, 3], [20, 9], [20, 14], [21, 12]]

whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]

whisper-large-v2: [[10, 12], [13, 17], [16, 11], [16, 12], [16, 13], [17, 15], [17, 16], [18, 4], [18, 11], [18, 19], [19, 11], [21, 2], [21, 3], [22, 3], [22, 9], [22, 12], [23, 5], [23, 7], [23, 13], [25, 5], [26, 1], [26, 12], [27, 15]]

whisper-large: same as large-v2

@xenova
Copy link

xenova commented Jul 8, 2023

What alignment heads must we use for https://huggingface.co/openai/whisper-large ? You have listed:

  • whisper-large: same as large-v2
  • whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]

however, I thought whisper-large and whisper-large-v1 were the same? As an example, https://huggingface.co/openai/whisper-large-v1 doesn't exist, while https://huggingface.co/openai/whisper-large does

@hollance
Copy link
Author

hollance commented Jul 9, 2023

Try them both, see which one works best. ;-) @xenova

@Ar770
Copy link

Ar770 commented Jul 11, 2023

@hollance thank you for the great work!
Do you know what are the right alignment_heads for a Peft fine-tuned model (on large-v2)?
How can I check it?

@hollance
Copy link
Author

@Ar770 I haven't done it myself but you can look at the (average) cross-attention weights for a test set, and then use the attention heads that give the nicest looking cross-attentions. In other words, when plotted the cross-attention weights should form a diagonal.

@Ar770
Copy link

Ar770 commented Jul 12, 2023

@hollance Can you refer to something similar? I have no clue how it is done.

@hollance
Copy link
Author

@Ar770 I haven't done this myself. Perhaps the OpenAI folks have something they can share, as this is using their method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment