Skip to content

Instantly share code, notes, and snippets.

@dannguyen
Last active March 2, 2022 09:31
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save dannguyen/712e52648ec85d98cf9ab31c103931a4 to your computer and use it in GitHub Desktop.
Save dannguyen/712e52648ec85d98cf9ab31c103931a4 to your computer and use it in GitHub Desktop.
How Google's text-to-speech API performs when reading the New York Times

Demo of Google text-to-speech Wavenet API on a NYT article

Was curious if Google's text-to-speech API might be good enough for generating audio versions of stories on-the-fly. Google has offered traditional computer voices for awhile, but last year made available their premium WaveNet voices, which are trained using audio recorded from human speakers, and are purportedly capable of mimicking natural-sounding inflection and rhythm.

tl;dr results

Pretty good...but I honestly can't tell the difference between the standard voice and the WaveNet version, at least when it comes to intonation and inflection. The first 2 grafs of this NYT story, roughly 85 words/560 characters, took less than 2 seconds to process. The result in both cases is a 37-second second audio file.

The text input is taken from the first 2 paragraphs from the story currently on the NYT's homepage: As McKinsey Sells Advice, Its Hedge Fund May Have a Stake in the Outcome (~85 words, ~560 characters):

The sins of Valeant Pharmaceuticals are well known. Instead of spending to develop new drugs, Valeant bought out other drugmakers, then increased prices of lifesaving medicines by as much as 5,785 percent. Patients had no choice but to pay.

Valeant’s chief executive, J. Michael Pearson, was hauled into a 2016 Senate hearing and verbally thrashed by lawmakers. “It’s using patients as hostages. It’s immoral,” said Claire McCaskill, then the Democratic senator from Missouri. One executive went to prison for fraud. The company’s share price collapsed.

More info

Google offers about 60 voices, including 28 WaveNet voices for English (and several European and Asian languages), male and female. The cost for WaveNet is $16 for 1 million characters, which is 4x the price of a standard voice. If you create a Google Cloud Platform account, the first million characters per month is free.

The v1 API itself is pretty straightforward. You use the text.synthesize POST method, which you can try in the GCP interactive console here.

   POST https://texttospeech.googleapis.com/v1/text:synthesize?fields=audioContent&key={YOUR_API_KEY}

If you've downloaded the JSON response as response.json, you can deserialize it in Python like this:

from base64 import b64decode
import json
from pathlib import Path

INFILE = 'response.json'
data = json.loads(Path(INFILE).read_text())
audio = b64decode(data['audioContent'])
Path('audio.mp3').write_bytes(audio)

You can try the API for yourself without a Google dev account, I think, by going to https://cloud.google.com/text-to-speech/ and scrolling down midway:

image

image

Amazon has its own text-to-speech service, which is named "Polly". I didn't bother trying to programmatically use the API because Polly's landing page is easy enough to cut-and-paste into. Polly is definitely more robotic-sounding than Google's WaveNet. And in this small sample text, it's less accurate on the proper nouns, e.g. pronouncing "Valeant" as VAIL-e-ent -- though that's less surprising than the fact that WaveNet somehow "knows" Valeant's correct pronounciation ("valiant"). Polly charges $4.00/million characters, which is the same as Google's standard (i.e. non-WaveNet) API, but as I mentioned above, I had a very hard time telling the difference between the premium WaveNet voice and its standard version.

{
"voice": {
"name": "en-US-Wavenet-B",
"languageCode": "en-US"
},
"audioConfig": {
"audioEncoding": "MP3"
},
"input": {
"text": "The sins of Valeant Pharmaceuticals are well known. Instead of spending to develop new drugs, Valeant bought out other drugmakers, then increased prices of lifesaving medicines by as much as 5,785 percent. Patients had no choice but to pay.\n\nValeant’s chief executive, J. Michael Pearson, was hauled into a 2016 Senate hearing and verbally thrashed by lawmakers. “It’s using patients as hostages. It’s immoral,” said Claire McCaskill, then the Democratic senator from Missouri. One executive went to prison for fraud. The company’s share price collapsed."
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment