Skip to content

Instantly share code, notes, and snippets.

@hollance
Last active December 5, 2023 13:25
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save hollance/42e32852f24243b748ae6bc1f985b13a to your computer and use it in GitHub Desktop.
Save hollance/42e32852f24243b748ae6bc1f985b13a to your computer and use it in GitHub Desktop.
Alignment heads for Whisper word-level timestamps with Hugging Face Transformers

To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads must be added to the GenerationConfig object. This is a list of [layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.

If your Whisper checkpoint does not have the alignment_heads property yet, it can be added in two possible ways.

Method 1. Change the model.generation_config property:

# load the model
model = WhisperForConditionalGeneration.from_pretrained("your_checkpoint")

# set the new property
model.generation_config.alignment_heads = [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

Method 2. Add a new line to the generation_config.json file:

"alignment_heads": [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]],

After you're done, use push_to_hub to make these changes permanent:

model.push_to_hub("your_pretrained_checkpoint", use_auth_token="your_token_if_not_logged_in", create_pr=True)

The correct values for alignment_heads depend on the size of the model. Here are the appropriate values for the different Whisper model sizes. These are taken from the OpenAI checkpoints. If you fine-tuned your own checkpoint, you may need to inspect the cross-attention weights to find the appropriate layers and attention heads.

whisper-tiny: [[2, 2], [3, 0], [3, 2], [3, 3], [3, 4], [3, 5]]

whisper-tiny.en: [[1, 0], [2, 0], [2, 5], [3, 0], [3, 1], [3, 2], [3, 3], [3, 4]]

whisper-base: [[3, 1], [4, 2], [4, 3], [4, 7], [5, 1], [5, 2], [5, 4], [5, 6]]

whisper-base.en: [[3, 3], [4, 7], [5, 1], [5, 5], [5, 7]]

whisper-small: [[5, 3], [5, 9], [8, 0], [8, 4], [8, 7], [8, 8], [9, 0], [9, 7], [9, 9], [10, 5]]

whisper-small.en: [[6, 6], [7, 0], [7, 3], [7, 8], [8, 2], [8, 5], [8, 7], [9, 0], [9, 4], [9, 8], [9, 10], [10, 0], [10, 1], [10, 2], [10, 3], [10, 6], [10, 11], [11, 2], [11, 4]]

whisper-medium: [[13, 15], [15, 4], [15, 15], [16, 1], [20, 0], [23, 4]]

whisper-medium.en: [[11, 4], [14, 1], [14, 12], [14, 14], [15, 4], [16, 0], [16, 4], [16, 9], [17, 12], [17, 14], [18, 7], [18, 10], [18, 15], [20, 0], [20, 3], [20, 9], [20, 14], [21, 12]]

whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]

whisper-large-v2: [[10, 12], [13, 17], [16, 11], [16, 12], [16, 13], [17, 15], [17, 16], [18, 4], [18, 11], [18, 19], [19, 11], [21, 2], [21, 3], [22, 3], [22, 9], [22, 12], [23, 5], [23, 7], [23, 13], [25, 5], [26, 1], [26, 12], [27, 15]]

whisper-large: same as large-v2

@desimmons
Copy link

I've taken the recommended approach

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small", cache_dir='')
alignment_heads = [[5, 3], [5, 9], [8, 0], [8, 4], [8, 7], [8, 8], [9, 0], [9, 7], [9, 9], [10, 5]]
model.generation_config.alignment_heads = alignment_heads
processor = WhisperProcessor.from_pretrained("openai/whisper-small", cache_dir='')
pipe = AutomaticSpeechRecognitionPipeline(
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor
)

# ...

output = pipe(numpy_input, chunk_length_s=30, stride_length_s=[4, 2], return_timestamps="word")

Using the a short audio file, I receive the following output:
{'text': ' Hello my name is David Simmons. Je mange du poisson.', 'chunks': [{'timestamp': (0.0, 6.56), 'text': ' Hello my name is David Simmons. Je mange du poisson.'}]}

Is it possible for my timestamps to be per word, or must I accept a single timestamp for the collection of words?

@hollance
Copy link
Author

hollance commented Jul 3, 2023

@desimmons You will need to use the main branch of Transformers to get this feature, it's not part of any official release yet.

@desimmons
Copy link

Thanks @hollance. Much appreciated! I was able to make substantial progress.

I'm now seeing very unstable timestamps for certain inputs. From what I can tell, if an input is too short (e.g., ~couple of seconds) or the audio clip chops an initial word, the timestamps get mangled. They seem to have a tendency to being close 30s.

Do you have any insight or guidance on this?

@hollance
Copy link
Author

hollance commented Jul 4, 2023

I have also seen that it doesn't always work very well with short inputs. This isn't something we can easily fix. This method of getting the timestamps uses the cross-attention weights, and if the model doesn't output very good cross-attentions (which seems to happen with short inputs) then the timestamps won't make much sense.

@xenova
Copy link

xenova commented Jul 8, 2023

What alignment heads must we use for https://huggingface.co/openai/whisper-large ? You have listed:

  • whisper-large: same as large-v2
  • whisper-large-v1: [[9, 19], [11, 2], [11, 4], [11, 17], [22, 7], [22, 11], [22, 17], [23, 2], [23, 15]]

however, I thought whisper-large and whisper-large-v1 were the same? As an example, https://huggingface.co/openai/whisper-large-v1 doesn't exist, while https://huggingface.co/openai/whisper-large does

@hollance
Copy link
Author

hollance commented Jul 9, 2023

Try them both, see which one works best. ;-) @xenova

@Ar770
Copy link

Ar770 commented Jul 11, 2023

@hollance thank you for the great work!
Do you know what are the right alignment_heads for a Peft fine-tuned model (on large-v2)?
How can I check it?

@hollance
Copy link
Author

@Ar770 I haven't done it myself but you can look at the (average) cross-attention weights for a test set, and then use the attention heads that give the nicest looking cross-attentions. In other words, when plotted the cross-attention weights should form a diagonal.

@Ar770
Copy link

Ar770 commented Jul 12, 2023

@hollance Can you refer to something similar? I have no clue how it is done.

@hollance
Copy link
Author

@Ar770 I haven't done this myself. Perhaps the OpenAI folks have something they can share, as this is using their method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment