Skip to content

Instantly share code, notes, and snippets.

@ljnmedium
ljnmedium / ex.md
Created July 12, 2023 12:35
exemple.md
Error Definition
False Alarm Speech segment predicted where there is no speaker (False positive from VAD model)
Missed Detection No speech detected where there is a speaker (False negative from VAD model)
Confusion Speech is in the wrong cluster (error from the clustering model)
@ljnmedium
ljnmedium / metric.md
Created July 12, 2023 12:37
metric.md
Error Definition
False Alarm Speech segment predicted where there is no speaker (False positive from VAD model)
Missed Detection No speech detected where there is a speaker (False negative from VAD model)
Confusion Speech is in the wrong cluster (error from the clustering model)
@ljnmedium
ljnmedium / performe2.md
Created July 12, 2023 13:01
performe2.md
Model DER CDER BER MS FA SC
Pyannote - 7 clusters specified 0.49 0.80 0.74 0.07 0.23 0.19
Nemo - no cluster number specified 0.17 0.84 0.24 0.08 0.07 0.02
Nemo - manual parameter tuning 0.12 0.15
@ljnmedium
ljnmedium / perform1.md
Created July 12, 2023 13:02
perform1.md
Model DER CDER BER MD FA SC
pyannote 0.10 0.14 0.18 0.01 0.06 0.03
NeMo - default parameters 0.37 0.32 0.44 0.36 0.01 0.01
NeMo - optimized VAD parameters 0.11 0.16 0.15 0.04 0.06 0.01
@ljnmedium
ljnmedium / compare.md
Last active July 17, 2023 08:57
compare.md
pyannote Nemo
Pre-trained models available
Good overlapping speakers detection (multilabel segmentation)
Easy integration with ASR task and downstream NLP tasks
Possibility to specify the number of speaker as a parameter for inference
Automatic detection of the number of speakers
Models available for specific use cases (phone call, outdoor conversation, high quality,…)
Highly customizable pipeline
@ljnmedium
ljnmedium / step.md
Last active July 17, 2023 07:56
step.md
pyannote NeMo
Voice Activity Detection (VAD) Pyannet derived from Syncnet MarbleNet
Audio embedding ECAPA-TDNN TitaNet
Clustering Hidden Markov Model clustering Multi-scale clustering (MSDD)
@ljnmedium
ljnmedium / conclu.md
Created July 12, 2023 13:12
conclu.md
Model Parameter Name Value
General Input sample rate 16 000
Batch size 16
VAD Window length 0.8
Shift length 0.04
Pad onset 0.1
Pad offset -0.05
Speaker embedding Window length [1.5,1.25,1.0,0.75,0.5]
Shift length [0.75,0.625,0.5,0.375,0.25]
@ljnmedium
ljnmedium / pipeline.md
Created July 12, 2023 13:15
pipeline.md
Task Model version Comments
Voice Activity Detection Multilingual Marblenet Other versions exist trained on telephonic conversation or only on english data
Speaker Embeddings Titanet Large Smaller version of the model exists.
Multiscale Clustering Diarization MSDD Telephonic Specifically trained on telephonic conversations which makes it suitable for similar use cases.
@ljnmedium
ljnmedium / tab.md
Created July 13, 2023 15:56
tab.md

| | start | length | label | text