-
-
Save johnmeade/d8d2c67b87cda95cd253f55c21387e75 to your computer and use it in GitHub Desktop.
import numpy as np | |
def wada_snr(wav): | |
# Direct blind estimation of the SNR of a speech signal. | |
# | |
# Paper on WADA SNR: | |
# http://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf | |
# | |
# This function was adapted from this matlab code: | |
# https://labrosa.ee.columbia.edu/projects/snreval/#9 | |
# init | |
eps = 1e-10 | |
# next 2 lines define a fancy curve derived from a gamma distribution -- see paper | |
db_vals = np.arange(-20, 101) | |
g_vals = np.array([0.40974774, 0.40986926, 0.40998566, 0.40969089, 0.40986186, 0.40999006, 0.41027138, 0.41052627, 0.41101024, 0.41143264, 0.41231718, 0.41337272, 0.41526426, 0.4178192 , 0.42077252, 0.42452799, 0.42918886, 0.43510373, 0.44234195, 0.45161485, 0.46221153, 0.47491647, 0.48883809, 0.50509236, 0.52353709, 0.54372088, 0.56532427, 0.58847532, 0.61346212, 0.63954496, 0.66750818, 0.69583724, 0.72454762, 0.75414799, 0.78323148, 0.81240985, 0.84219775, 0.87166406, 0.90030504, 0.92880418, 0.95655449, 0.9835349 , 1.01047155, 1.0362095 , 1.06136425, 1.08579312, 1.1094819 , 1.13277995, 1.15472826, 1.17627308, 1.19703503, 1.21671694, 1.23535898, 1.25364313, 1.27103891, 1.28718029, 1.30302865, 1.31839527, 1.33294817, 1.34700935, 1.3605727 , 1.37345513, 1.38577122, 1.39733504, 1.40856397, 1.41959619, 1.42983624, 1.43958467, 1.44902176, 1.45804831, 1.46669568, 1.47486938, 1.48269965, 1.49034339, 1.49748214, 1.50435106, 1.51076426, 1.51698915, 1.5229097 , 1.528578 , 1.53389835, 1.5391211 , 1.5439065 , 1.54858517, 1.55310776, 1.55744391, 1.56164927, 1.56566348, 1.56938671, 1.57307767, 1.57654764, 1.57980083, 1.58304129, 1.58602496, 1.58880681, 1.59162477, 1.5941969 , 1.59693155, 1.599446 , 1.60185011, 1.60408668, 1.60627134, 1.60826199, 1.61004547, 1.61192472, 1.61369656, 1.61534074, 1.61688905, 1.61838916, 1.61985374, 1.62135878, 1.62268119, 1.62390423, 1.62513143, 1.62632463, 1.6274027 , 1.62842767, 1.62945532, 1.6303307 , 1.63128026, 1.63204102]) | |
# peak normalize, get magnitude, clip lower bound | |
wav = np.array(wav) | |
wav = wav / abs(wav).max() | |
abs_wav = abs(wav) | |
abs_wav[abs_wav < eps] = eps | |
# calcuate statistics | |
# E[|z|] | |
v1 = max(eps, abs_wav.mean()) | |
# E[log|z|] | |
v2 = np.log(abs_wav).mean() | |
# log(E[|z|]) - E[log(|z|)] | |
v3 = np.log(v1) - v2 | |
# table interpolation | |
wav_snr_idx = None | |
if any(g_vals < v3): | |
wav_snr_idx = np.where(g_vals < v3)[0].max() | |
# handle edge cases or interpolate | |
if wav_snr_idx is None: | |
wav_snr = db_vals[0] | |
elif wav_snr_idx == len(db_vals) - 1: | |
wav_snr = db_vals[-1] | |
else: | |
wav_snr = db_vals[wav_snr_idx] + \ | |
(v3-g_vals[wav_snr_idx]) / (g_vals[wav_snr_idx+1] - \ | |
g_vals[wav_snr_idx]) * (db_vals[wav_snr_idx+1] - db_vals[wav_snr_idx]) | |
# Calculate SNR | |
dEng = sum(wav**2) | |
dFactor = 10**(wav_snr / 10) | |
dNoiseEng = dEng / (1 + dFactor) # Noise energy | |
dSigEng = dEng * dFactor / (1 + dFactor) # Signal energy | |
snr = 10 * np.log10(dSigEng / dNoiseEng) | |
return snr |
I believe the speech statistics don't change much with sample rate, so this should work on a decent range of sample rates. The tools I adapted this from downsample the audio to 8kHz before computing the SNR, but I've used this with up to 22kHz audio. Probably worth trying out a few different sample rates.
Thanks for your reply. Yes, I also find that the estimate doesn't change much when 16KHz signals are downsampled to 8KHz. Tip: line 22 can be changed from wav /= abs(wav).max()
to wav = wav/abs(wav).max()
so that there is no error for integer-format audio data.
Thanks
Hello, John! Have you had problems with this algorithm? I try to use this implementation of WADA SNR to solve SNR estimation problem. I have result 100.0dB for very noisy files! Is it work correctly? I have results in range 12-22dB for pretty good speech. In my opinion there is something wrong.
Yes you are correct, here are some SNR scores for some random speech data:
The issue is coming from the "edge case" handling here: wav_snr = db_vals[-1]
, and a similar situation will happen with SNR=-20
. You can try to tune the gamma "shape" parameter to get a better fit for your data (explained in paper), it is assumed to be 0.4 here, but a value of 0.5 could be used just as well. That would involve re-computing the values for the curve that are hardcoded in here as g_vals
. Perhaps dropping samples that hit these edge cases is another option for you?
I think this is just a limitation of the method at the end of the day, but if anyone has more insight I'd be happy to hear!
Cheers
Hi, thanks for this implementation. Have you tested in practice how long needs to be the wav file so that the estimated statistics match the theoretical ones? (ie that the amplitudes of clean speech samples match the expected gamma distribution). I was wondering which of the 2 approaches (this one vs the snr estimated after VAD) is the most accurate?
Shorter clips can indeed bias your results with this method, but I haven't tested what the limits are. I would probably recommend the VAD method in the other gist, due to the somewhat frequent edge cases and higher complexity of this method. The VAD method requires some silence regions in the audio however, which is an edge-case of it's own. A combination of the two could work, although I suspect the SNR from each method won't match exactly which could cause issues. I'm afraid I don't have a satisfying answer 😅
Hello, thanks a lot for sharing this piece of code.
To be sure to understand, the 100dB would correspond to very noisy samples because of edge cases, but apart from these edge cases, samples that are e.g. >90dB and <99dB are very clean samples (high SNR)?
Could you confirm we can use this as MIT Licensed, like your VAD gist you're mentioning here?
I was able to get good results by dropping samples at -20dB and 100dB, but as this is a statistical method you may still find exceptions.
As for licensing, this is a re-implementation of the original matlab code, but the source code doesn't have a license that I could find. I reached out to the authors months ago and haven't heard back, so I'm not sure it can be used in a commercial setting 🤷. I would license it as MIT (or the least restrictive derivative license) if I could though.
Your last block of code is not needed, e.g.
snr = 10 * np.log10(dSigEng / dNoiseEng)
= 10 * np.log10((dEng * dFactor / (1 + dFactor)) / (dEng / (1 + dFactor)))
= 10 * np.log10(dFactor)
= 10 * np.log10(10**(wav_snr / 10))
= wav_snr
Hello, I have a short question :) I want to use this function to calculate the SNR for audio files generated with a TTS system. So far, I get a SNR value of 100 for every audio. I also tried it with a different audio file (one that definitely contains noise) and the SNR value was 21. So I don't think I did anything incorrectly with the function.
Would you still disregard the SNR values for the TTS sentences? Or do you think this could a possible result?
I'm very new to this field, so I just would like to be sure. Thank you! :)
Hello, I am trying to re-computing the value of g_vals
above with 0.5 shape parameters. Are you able to share the code of computing g_vals
? It is quite hard to implement the integral equation in the paper. Thank you so much!
Thanks for the work. Does this expect data from 16KHz wavfiles?