Racum/google_speech_apis.md

## google_speech_apis.md

      
    Raw
  

              google_speech_apis.md
            
          
    Google Speech to Text API

Audio

Format:

Format: FLAC
Sample Rate: 16000Hz
Chanels: 1 (mono)
Depth: 16bit
Bitrate: Unkwown limits, worked fine for 160Kbps

Converting to FLAC:
$ ffmpeg -i input_file.m4a input_file.mp3  # If needed (I recorded in a iPhone)
$ sox input_file.mp3 -r 16000 -b 16 -c 1 input_file.flac

API Call

Request


Endpoint: https://www.google.com/speech-api/v1/recognize
Method: POST
Headers:

Content-Type: audio/x-flac; rate=16000;
User-Agent: Copy from a recent Chrome browser


Query:

lang: pt-BR, en-US…
client: chrome
maxresults: integer


Payload: Flac binary file.

Call:
$ curl -X POST --data-binary @audio_file.flac \
--header "Content-Type: audio/x-flac; rate=16000;" \
--user-agent "speech2text" \
"https://www.google.com/speech-api/v1/recognize?client=chromium&lang=pt-BR&maxresults=10"

Response

{
    "status": 0,
    "id": "ecdd8041e3eaee77c6897d0939389e40-1",
    "hypotheses": [
        {
            "utterance": "são paulo sim não celular wesley",
            "confidence": 0.77939147
        },
        {
            "utterance": "são paulo sim ou não celular wesley"
        }
    ]
}

Samples

Brazilian Portuguese

I asked some native brazilian portuguese speakers to say "São Paulo, sim, não, celular, [your name]", and got:

"são paulo sim não celular wesley", confidence: 0.77939147
"são paulo sim não celular rafael", confidence: 0.7373202
"são paulo sim não celular roberta", confidence: 0.6814435
"são paulo sim não celular pedro", confidence: 0.70905286
"são paulo sim não celular diogo", confidence: 0.71847606

All testers were recorded using the same device (iPhone5) in similar noise conditions.
Spanish

Same test as above, but localized: "Santiago, si, no, celular, [your name]":

"santiago si no celular fernanda", confidence: 0.6885493

Non-native speaker
Tested with Chile locale


"Santiago si no celular Eduardo", confidence: 0.7460225

Non-native speaker
Tested as plain spanish without locale
Only case of proper names capilized


Hebrew

Same test as above, but localized: "Tel Aviv, yes, no, cellphone, [your name]":

"תל אביב כן לא סלולרי שלומי אסף", confidence: 0.82569724

100% accurate


Japanese

Same test as above, but localized: "Tokyo, yes, no, cellphone, [your name]":

"東京 はい いいえ 携帯 ほな男", confidence: 0.4775479

Non-native speaker
Bad parse of non-japanese name

It didn't understand it was foreign
Should be parsed as "ロナウド" instead of "ほな男"
Probably fine for japanese names


All other words were fine
Confidence of 0.71819603 for the same audio without the name


Google Text to Speech API

API Call

Request


Endpoint: https://translate.google.com/translate_tts
Method: GET
Headers:

User-Agent: Copy from a recent Chrome browser


Query:

tl: pt-BR, en-US…
q: Texto to be spoken


Call:
$ curl -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36' \
'http://translate.google.com/translate_tts?tl=pt-BR&q=Rio%20de%20Janeiro' > output.mp3

Response

A MP3 file, with the following specification:
Format:

Format: MP3
Sample Rate: 22050Hz
Chanels: 1 (mono)
Depth: 16bit
Bitrate: 32Kbps

Gender/Notes:

Portuguese

pt-BR: female
pt-PT: female


English

en-US: female
en-GB: male


Spanish

es-MX: female
es-ES: female
es-CL: female


Japanese

ja-JP: female (doesn't understand romaji, tries to spell every letter)


Hebrew

he: no output