Format:
- Format: FLAC
- Sample Rate: 16000Hz
- Chanels: 1 (mono)
- Depth: 16bit
- Bitrate: Unkwown limits, worked fine for 160Kbps
Converting to FLAC:
$ ffmpeg -i input_file.m4a input_file.mp3 # If needed (I recorded in a iPhone)
$ sox input_file.mp3 -r 16000 -b 16 -c 1 input_file.flac
- Endpoint:
https://www.google.com/speech-api/v1/recognize
- Method: POST
- Headers:
- Content-Type:
audio/x-flac; rate=16000;
- User-Agent: Copy from a recent Chrome browser
- Content-Type:
- Query:
- lang:
pt-BR
,en-US
… - client:
chrome
- maxresults: integer
- lang:
- Payload: Flac binary file.
Call:
$ curl -X POST --data-binary @audio_file.flac \
--header "Content-Type: audio/x-flac; rate=16000;" \
--user-agent "speech2text" \
"https://www.google.com/speech-api/v1/recognize?client=chromium&lang=pt-BR&maxresults=10"
{
"status": 0,
"id": "ecdd8041e3eaee77c6897d0939389e40-1",
"hypotheses": [
{
"utterance": "são paulo sim não celular wesley",
"confidence": 0.77939147
},
{
"utterance": "são paulo sim ou não celular wesley"
}
]
}
I asked some native brazilian portuguese speakers to say "São Paulo, sim, não, celular, [your name]", and got:
- "são paulo sim não celular wesley", confidence: 0.77939147
- "são paulo sim não celular rafael", confidence: 0.7373202
- "são paulo sim não celular roberta", confidence: 0.6814435
- "são paulo sim não celular pedro", confidence: 0.70905286
- "são paulo sim não celular diogo", confidence: 0.71847606
All testers were recorded using the same device (iPhone5) in similar noise conditions.
Same test as above, but localized: "Santiago, si, no, celular, [your name]":
- "santiago si no celular fernanda", confidence: 0.6885493
- Non-native speaker
- Tested with Chile locale
- "Santiago si no celular Eduardo", confidence: 0.7460225
- Non-native speaker
- Tested as plain spanish without locale
- Only case of proper names capilized
Same test as above, but localized: "Tel Aviv, yes, no, cellphone, [your name]":
- "תל אביב כן לא סלולרי שלומי אסף", confidence: 0.82569724
- 100% accurate
Same test as above, but localized: "Tokyo, yes, no, cellphone, [your name]":
- "東京 はい いいえ 携帯 ほな男", confidence: 0.4775479
- Non-native speaker
- Bad parse of non-japanese name
- It didn't understand it was foreign
- Should be parsed as "ロナウド" instead of "ほな男"
- Probably fine for japanese names
- All other words were fine
- Confidence of 0.71819603 for the same audio without the name
- Endpoint:
https://translate.google.com/translate_tts
- Method: GET
- Headers:
- User-Agent: Copy from a recent Chrome browser
- Query:
- tl:
pt-BR
,en-US
… - q: Texto to be spoken
- tl:
Call:
$ curl -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36' \
'http://translate.google.com/translate_tts?tl=pt-BR&q=Rio%20de%20Janeiro' > output.mp3
A MP3 file, with the following specification:
Format:
- Format: MP3
- Sample Rate: 22050Hz
- Chanels: 1 (mono)
- Depth: 16bit
- Bitrate: 32Kbps
Gender/Notes:
- Portuguese
- pt-BR: female
- pt-PT: female
- English
- en-US: female
- en-GB: male
- Spanish
- es-MX: female
- es-ES: female
- es-CL: female
- Japanese
- ja-JP: female (doesn't understand romaji, tries to spell every letter)
- Hebrew
- he: no output