Skip to content

Instantly share code, notes, and snippets.

@Racum
Last active December 19, 2015 16:58
Show Gist options
  • Save Racum/5987351 to your computer and use it in GitHub Desktop.
Save Racum/5987351 to your computer and use it in GitHub Desktop.

Google Speech to Text API

Audio

Format:

  • Format: FLAC
  • Sample Rate: 16000Hz
  • Chanels: 1 (mono)
  • Depth: 16bit
  • Bitrate: Unkwown limits, worked fine for 160Kbps

Converting to FLAC:

$ ffmpeg -i input_file.m4a input_file.mp3  # If needed (I recorded in a iPhone)
$ sox input_file.mp3 -r 16000 -b 16 -c 1 input_file.flac

API Call

Request

  • Endpoint: https://www.google.com/speech-api/v1/recognize
  • Method: POST
  • Headers:
    • Content-Type: audio/x-flac; rate=16000;
    • User-Agent: Copy from a recent Chrome browser
  • Query:
    • lang: pt-BR, en-US
    • client: chrome
    • maxresults: integer
  • Payload: Flac binary file.

Call:

$ curl -X POST --data-binary @audio_file.flac \
--header "Content-Type: audio/x-flac; rate=16000;" \
--user-agent "speech2text" \
"https://www.google.com/speech-api/v1/recognize?client=chromium&lang=pt-BR&maxresults=10"

Response

{
    "status": 0,
    "id": "ecdd8041e3eaee77c6897d0939389e40-1",
    "hypotheses": [
        {
            "utterance": "são paulo sim não celular wesley",
            "confidence": 0.77939147
        },
        {
            "utterance": "são paulo sim ou não celular wesley"
        }
    ]
}

Samples

Brazilian Portuguese

I asked some native brazilian portuguese speakers to say "São Paulo, sim, não, celular, [your name]", and got:

  • "são paulo sim não celular wesley", confidence: 0.77939147
  • "são paulo sim não celular rafael", confidence: 0.7373202
  • "são paulo sim não celular roberta", confidence: 0.6814435
  • "são paulo sim não celular pedro", confidence: 0.70905286
  • "são paulo sim não celular diogo", confidence: 0.71847606

All testers were recorded using the same device (iPhone5) in similar noise conditions.

Spanish

Same test as above, but localized: "Santiago, si, no, celular, [your name]":

  • "santiago si no celular fernanda", confidence: 0.6885493
    • Non-native speaker
    • Tested with Chile locale
  • "Santiago si no celular Eduardo", confidence: 0.7460225
    • Non-native speaker
    • Tested as plain spanish without locale
    • Only case of proper names capilized

Hebrew

Same test as above, but localized: "Tel Aviv, yes, no, cellphone, [your name]":

  • "תל אביב כן לא סלולרי שלומי אסף", confidence: 0.82569724
    • 100% accurate

Japanese

Same test as above, but localized: "Tokyo, yes, no, cellphone, [your name]":

  • "東京 はい いいえ 携帯 ほな男", confidence: 0.4775479
    • Non-native speaker
    • Bad parse of non-japanese name
      • It didn't understand it was foreign
      • Should be parsed as "ロナウド" instead of "ほな男"
      • Probably fine for japanese names
    • All other words were fine
    • Confidence of 0.71819603 for the same audio without the name

Google Text to Speech API

API Call

Request

  • Endpoint: https://translate.google.com/translate_tts
  • Method: GET
  • Headers:
    • User-Agent: Copy from a recent Chrome browser
  • Query:
    • tl: pt-BR, en-US
    • q: Texto to be spoken

Call:

$ curl -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36' \
'http://translate.google.com/translate_tts?tl=pt-BR&q=Rio%20de%20Janeiro' > output.mp3

Response

A MP3 file, with the following specification:

Format:

  • Format: MP3
  • Sample Rate: 22050Hz
  • Chanels: 1 (mono)
  • Depth: 16bit
  • Bitrate: 32Kbps

Gender/Notes:

  • Portuguese
    • pt-BR: female
    • pt-PT: female
  • English
    • en-US: female
    • en-GB: male
  • Spanish
    • es-MX: female
    • es-ES: female
    • es-CL: female
  • Japanese
    • ja-JP: female (doesn't understand romaji, tries to spell every letter)
  • Hebrew
    • he: no output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment