Skip to content

Instantly share code, notes, and snippets.

@rom1mouret
Last active August 29, 2015 14:16
Show Gist options
  • Save rom1mouret/72d9de472bfb4b3384ca to your computer and use it in GitHub Desktop.
Save rom1mouret/72d9de472bfb4b3384ca to your computer and use it in GitHub Desktop.
Daisy Pipeline2 1.10 TTS configuration

TTS configuration

Minimal Requirements

The audio encoder Lame must be installed. Lame’s location must be in the system PATH (i.e. $PATH on Unix, %PATH% on Windows), unless it is provided via the 'lame.path' property.

In addition, one of the following text-to-speech processors must be installed:

For Unix users,

  • Acapela;

  • eSpeak.

For Windows users,

  • eSpeak;

  • SAPI with adequate voices.

For MacOS users,

  • Say.

It is strongly recommended to install eSpeak anyway, as it can handle almost any language out there. Just as Lame’s directory must be in the system PATH if it were installed in a dedicated subdirectory, you may need to append eSpeak’s installation directory to your PATH as well. On Unix systems, 'apt-get install' already takes care of installing eSpeak to a known location. On Windows, however, the PATH variable must be changed manually using the 'environment variables' panel.

Configuration File

The text-to-speech features -TTS for short- are integrated in the following scripts:

  • dtbook-to-daisy3

  • zedai-to-epub3

  • dtbook-to-epub3

All of these scripts take an optional configuration file as option. Here is an example of configuration file:

<config xmlns="http://example">
  <property key="log" value="true"/>
  <voice engine="acapela" name="manon" gender="female-adult" priority="100" lang="fr"/>
  <lexicon href="lexicon-1.pls"/>
  <lexicon href="lexicon-2.pls"/>
  <css href="css-for-dtbooks.css"/>
  <css href="css-for-zedai.css"/>
</config>

Both relative and absolute paths are accepted as a value of the "href" attributes. Relative paths are relative to the configuration file’s location. Absolute paths work only in local mode.

The elements can be put in any namespace since namespaces aren’t checked. If there is any syntax error in the file, you will be notified in the server’s logs.

Properties

The syntax for setting properties is as follows:

<config xmlns="http://example">
  <property key="espeak.path" value="/usr/bin/espeak"/>
</config>

Other supported properties are:

  • "lame.path"

  • "lame.cli.options"

  • "sapi.bytespersample" (see below for a description)

  • "sapi.samplerate"

  • "log"

  • "threads.number": min number of threads

Properties cannot be freely changed unless one of these two conditions are met:

  • host.protection is set to false in the system.properties file

  • the path of the configuration file is specified via the property "tts.config" of system.properties

PLS Lexicons

Lexicons are configured using the 'lexicon' elements. If the "href" attribute is missing, the pipeline will read the lexicons inside the config nodes, as in this example:

<config>
  <lexicon version="1.0" alphabet="ipa" xml:lang="en" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon">
   <lexeme>
      ...
   </lexeme>
  </lexicon>
</lexicon>

This feature lets you substitute words with custom pronunciation respellings and IPA phonemes. It is meant to help TTS processors deal with ambiguous abbreviations and pronunciation of proper names. The lexicons follow the Pronunciation Lexicon Specification Version 1.0, extended with XPath-regex matching.

Add the 'regex' attribute to enable the regexp matching, as follows:

<lexicon version="1.0" alphabet="ipa" xml:lang="en" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon">
 <lexeme regex="true">
    <grapheme>([0-9]+)-([0-9]+)</grapheme>
    <alias>between $1 and $2</alias>
 </lexeme>
</lexicon>

The regex feature works only with alias-based substitutions.

Whether or not the regex attribute is set to 'true', the grapheme matching can be made more accurate by specifying the 'positive-lookahead' and 'negative-lookahead' XPath-regex attributes:

<lexicon version="1.0" alphabet="ipa" xml:lang="en"
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon">
  <lexeme>
    <grapheme positive-lookahead="[ ]+is">SB</grapheme>
    <alias>somebody</alias>
  </lexeme>
  <lexeme>
    <grapheme>SB</grapheme>
    <alias>should be</alias>
  </lexeme>
  <lexeme xml:lang="fr">
    <grapheme positive-lookahead="[ ]+[cC]ity">boston</grapheme>
    <phoneme>bɔstøn</phoneme>
  </lexeme>
</lexicon>

Graphemes with 'positive-lookehead' will match if what follows begins with the regex of 'position-lookahead'. Graphemes with 'negative-lookehead' will match if what follows does not begin with the regex of 'negative-lookahead'. The lookaheads are case-sensitive while the grapheme contents are not.

The lexemes are reorganized so as to be matched in this order: 1. Graphemes with regex='false' come first, no matter if there is a look-ahead or not; 2. Graphemes with regex='true' and no look-ahead; 3. Graphemes with regex='true' and one or two look-aheads.

Within these categories, lexemes are matched in the same order as they appear in the lexicons.

Aural CSS

The text-to-speech voices and prosody can be configured with Aural CSS. To do so, add one or more 'css' elements to the configuration file. If the href attribute is missing, the CSS stylesheets will be interpreted as inlined in the configuration file:

<config>
  <css>
    p {
	volume: soft;
	voice-family: female;
    }
  </css>
</config>

This config entry will cause the pipeline to apply a local Aural CSS stylesheet on the input document. It supports only a subset of Aural CSS 2.1:

  • speak: none | spell-out

  • pause, pause-before, pause-after: <duration>

  • volume: <number> | silent | x-soft | soft | medium | loud | x-loud

  • pitch: x-low | low | medium | high | x-high

  • pitch-range: <number>

  • speech-rate: <number> | x-slow | slow | medium | fast | x-fast

  • speak-numeral: digits | continuous

  • voice-family

  • cue, cue-before, cue-after: url(<uri>)

In addition to this option, local CSS stylesheets referenced in the Processing-Instructions and by links in the headers will be loaded too.

Inspired by the specifications of CSS3 Speech, 'voice-family' is a comma-separated list of voice characteristics that place conditions on the voice selection.

If a full voice name is provided, e.g. "acapela, alice", this voice will be selected regardless of the document language. If this voice is not available, a fallback voice will be chosen such that it will match with the same characteristics as those of the requested voice: same language, same engine, same gender. If none is available, the pipeline broadens its search by relaxing the criteria: first the gender is relaxed and then the engine.

If no voice name is provided, e.g. "acapela" alone, "female" or "female, old" , the selection algorithm will take into consideration only the voices that match the current language. It starts by looking for a voice with the specified gender and supplied by the specified engine, and will broaden to any gender if the first search yielded no results. If neither the gender nor the engine match, language will be the only criterion.

When multiple voices match the criteria, the algorithm chooses the voice with the highest priority. A preset of priorities is already embedded in the pipeline, though they can be overridden via the 'voice' entries of the configuration file, as follows:

<config>
  <voice engine="sapi" name="Microsoft Todd" gender="male-adult" priority="100" lang="en"/>
</config>

Notice that it is also a convenient way to add voices that are not natively supported by the Pipeline. In the example above, Todd is now a registered voice and, as such, can be selected automatically by the Pipeline when the document is written in English.

AT&T, eSpeak and Acapela’s voice names can be found in their corresponding documentation. For Windows users, SAPI voices are enumerated in the system settings, e.g. Start > All Control Panel Items > Speech Recognition > Advanced Speech Options. You will also need to know the value of the "engine" attribute. This attribute must take as value one of the following:

  • 'att' for AT&T voices;

  • 'espeak' for eSpeak voices;

  • 'acapela' for Acapela voices;

  • 'osx-speech' for Apple voices;

  • 'sapi' for Microsoft voices or for any other voice installed to work with the SAPI engine, including some versions of AT&T and Acapela’s products.

In case of any doubt, engines and voice names can be retrieved from the server’s log in which all the voices are enumerated:

Available voices:
* {engine:'sapi', name:'NTMNTTS Voice (Male)'} by sapi-native
* {engine:'acapela', name:'alice'} by acapela-jna

Annotations (Daisy3 output only)

Not documented yet.

SAPI configuration

SAPI’s configuration works a bit differently. It has two special properties which cannot be overridden at runtime. They are set once and for all the first time SAPI is used:

'sapi.bytespersample': output number of bytes per audio sample (default=2)

'sapi.samplerate': SAPI’s output sample rate

The server must be restarted to change those properties.

If you have installed new voices and one of them can’t interpret Speech Synthesis Markup Language (SSML) or mark-based audio synchronization, SAPI’s initialization may fail. To deal with this kind of voices, you can switch off SSML marks in the configuration file:

<config>
  <voice engine="sapi" name="Not-A-Microsoft-Voice" marks="false" gender="male-adult" priority="100" lang="en"/>
</config>

The Log File (Daisy3 output only)

This is an optional entry of the configuration file.

<config>
  <property key="log" value="true"/>
</config>

This results in the Pipeline logging stuff in the output directory in a file named 'tts-log.xml'. The Pipeline will log a great deal of information to this file, which can be quite helpful for troubleshooting. Most of the log entries concern particular chunks of text of the input document. For more general errors, see the main server’s logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment