ljnmedium

## nhutljn-temporal-expression-demo-1.gif

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / nhutljn-temporal-expression-demo-1.gif
            
            
              Last active
              February 28, 2022 17:31
            
          
## ex.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / ex.md
            
            
              Created
              July 12, 2023 12:35
            
              
                exemple.md
              
          
Error
Definition


False Alarm
Speech segment predicted where there is no speaker (False positive from VAD model)


Missed Detection
No speech detected where there is a speaker (False negative from VAD model)


Confusion
Speech is in the wrong cluster (error from the clustering model)


## metric.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / metric.md
            
            
              Created
              July 12, 2023 12:37
            
              
                metric.md
              
          
Error
Definition


False Alarm
Speech segment predicted where there is no speaker (False positive from VAD model)


Missed Detection
No speech detected where there is a speaker (False negative from VAD model)


Confusion
Speech is in the wrong cluster (error from the clustering model)


## performe2.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / performe2.md
            
            
              Created
              July 12, 2023 13:01
            
              
                performe2.md
              
          
Model
DER
CDER
BER
MS
FA
SC


Pyannote - 7 clusters specified
0.49
0.80
0.74
0.07
0.23
0.19


Nemo - no cluster number specified
0.17
0.84
0.24
0.08
0.07
0.02


Nemo - manual parameter tuning
0.12

0.15


## perform1.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / perform1.md
            
            
              Created
              July 12, 2023 13:02
            
              
                perform1.md
              
          
Model
DER
CDER
BER
MD
FA
SC


pyannote
0.10
0.14
0.18
0.01
0.06
0.03


NeMo - default parameters
0.37
0.32
0.44
0.36
0.01
0.01


NeMo - optimized VAD parameters
0.11
0.16
0.15
0.04
0.06
0.01


## compare.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / compare.md
            
            
              Last active
              July 17, 2023 08:57
            
              
                compare.md
              
          
pyannote
Nemo


Pre-trained models available
✅
✅


Good overlapping speakers detection (multilabel segmentation)
✅
➖


Easy integration with ASR task and downstream NLP tasks
➖
✅


Possibility to specify the number of speaker as a parameter for inference
✅
✅


Automatic detection of the number of speakers
✅
✅


Models available for specific use cases (phone call, outdoor conversation, high quality,…)
❌
✅


Highly customizable pipeline
➖
✅


## step.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / step.md
            
            
              Last active
              July 17, 2023 07:56
            
              
                step.md
              
          
pyannote
NeMo


Voice Activity Detection (VAD)
Pyannet derived from Syncnet
MarbleNet


Audio embedding
ECAPA-TDNN
TitaNet


Clustering
Hidden Markov Model clustering
Multi-scale clustering (MSDD)


## conclu.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / conclu.md
            
            
              Created
              July 12, 2023 13:12
            
              
                conclu.md
              
          
Model
Parameter Name
Value


General
Input sample rate
16 000


Batch size
16


VAD
Window length
0.8


Shift length
0.04


Pad onset
0.1


Pad offset
-0.05


Speaker embedding
Window length
[1.5,1.25,1.0,0.75,0.5]


Shift length
[0.75,0.625,0.5,0.375,0.25]


## pipeline.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / pipeline.md
            
            
              Created
              July 12, 2023 13:15
            
              
                pipeline.md
              
          
Task
Model version
Comments


Voice Activity Detection
Multilingual Marblenet
Other versions exist trained on telephonic conversation or only on english data


Speaker Embeddings
Titanet Large
Smaller version of the model exists.


Multiscale Clustering
Diarization MSDD Telephonic
Specifically trained on telephonic conversations which makes it suitable for similar use cases.


## tab.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ljnmedium
                / tab.md
            
            
              Created
              July 13, 2023 15:56
            
              
                tab.md
              
          
    | | start | length | label | text
Error	Definition
False Alarm	Speech segment predicted where there is no speaker (False positive from VAD model)
Missed Detection	No speech detected where there is a speaker (False negative from VAD model)
Confusion	Speech is in the wrong cluster (error from the clustering model)
Model	DER	CDER	BER	MS	FA	SC
Pyannote - 7 clusters specified	0.49	0.80	0.74	0.07	0.23	0.19
Nemo - no cluster number specified	0.17	0.84	0.24	0.08	0.07	0.02
Nemo - manual parameter tuning	0.12		0.15
Model	DER	CDER	BER	MD	FA	SC
pyannote	0.10	0.14	0.18	0.01	0.06	0.03
NeMo - default parameters	0.37	0.32	0.44	0.36	0.01	0.01
NeMo - optimized VAD parameters	0.11	0.16	0.15	0.04	0.06	0.01
	`pyannote`	`Nemo`
Pre-trained models available	✅	✅
Good overlapping speakers detection (multilabel segmentation)	✅	➖
Easy integration with ASR task and downstream NLP tasks	➖	✅
Possibility to specify the number of speaker as a parameter for inference	✅	✅
Automatic detection of the number of speakers	✅	✅
Models available for specific use cases (phone call, outdoor conversation, high quality,…)	❌	✅
Highly customizable pipeline	➖	✅
	`pyannote`	`NeMo`
Voice Activity Detection (VAD)	Pyannet derived from Syncnet	MarbleNet
Audio embedding	ECAPA-TDNN	TitaNet
Clustering	Hidden Markov Model clustering	Multi-scale clustering (MSDD)
Model	Parameter Name	Value
General	Input sample rate	16 000
	Batch size	16
VAD	Window length	0.8
	Shift length	0.04
	Pad onset	0.1
	Pad offset	-0.05
Speaker embedding	Window length	[1.5,1.25,1.0,0.75,0.5]
	Shift length	[0.75,0.625,0.5,0.375,0.25]
Task	Model version	Comments
Voice Activity Detection	Multilingual Marblenet	Other versions exist trained on telephonic conversation or only on english data
Speaker Embeddings	Titanet Large	Smaller version of the model exists.
Multiscale Clustering	Diarization MSDD Telephonic	Specifically trained on telephonic conversations which makes it suitable for similar use cases.