Skip to content

Instantly share code, notes, and snippets.

@rom1mouret
Last active June 26, 2018 06:11
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save rom1mouret/182e5da19dcccef84397 to your computer and use it in GitHub Desktop.
Daisy Pipeline 2 TTS Configuration

TTS configuration

Minimal Requirements

The audio encoder Lame must be installed. Lame’s location must be in the system PATH (i.e. $PATH on Unix, %PATH% on Windows), unless it is provided via the 'lame.path' system property.

In addition, one of the following text-to-speech processors must be installed:

For Unix users,

  • Acapela;

  • eSpeak.

For Windows users,

  • eSpeak;

  • SAPI with adequate voices.

For MacOS users,

  • Say.

It is strongly recommended to install eSpeak anyway, as it can handle almost any language out there. Just as Lame’s directory must be in the system PATH if it were installed in a dedicated subdirectory, you may need to append eSpeak’s installation directory to your PATH as well. On Unix systems, 'apt-get install' already takes care of installing eSpeak to a known location. On Windows, however, the PATH variable must be changed manually using the 'environment variables' panel.

Configuration File

The text-to-speech features -TTS for short- are integrated in the following scripts:

  • dtbook-to-daisy3

  • zedai-to-epub3

  • dtbook-to-epub3

All of these scripts take an optional configuration file as option. Here is an example of configuration file:

<config xmlns="http://example">
  <property key="log" value="true"/>
  <voice engine="acapela" name="manon" gender="female-adult" priority="100" lang="fr"/>
  <lexicon href="lexicon-1.pls"/>
  <lexicon href="lexicon-2.pls"/>
  <css href="css-for-dtbooks.css"/>
  <css href="css-for-zedai.css"/>
</config>

Both relative and absolute paths are accepted as a value of the "href" attributes. Relative paths are relative to the configuration file’s location. Absolute paths work only in local mode.

The elements can be put in any namespace since namespaces aren’t checked. If there is any syntax error in the file, you will be notified in the server’s logs.

PLS Lexicons

Lexicons are configured using the 'lexicon' elements. If the "href" attribute is missing, the pipeline will read the lexicons inside the config nodes, as in this example:

<config>
  <lexicon version="1.0" alphabet="ipa" xml:lang="en" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon">
   <lexeme>
      ...
   </lexeme>
  </lexicon>
</lexicon>

This feature lets you substitute words with custom pronunciation respellings and IPA phonemes. It is meant to help TTS processors deal with ambiguous abbreviations and pronunciation of proper names. The lexicons follow the Pronunciation Lexicon Specification Version 1.0, extended with XPath-regex matching.

Add the 'regex' attribute to enable the regexp matching, as follows:

<lexicon version="1.0" alphabet="ipa" xml:lang="en" xmlns="http://www.w3.org/2005/01/pronunciation-lexicon">
 <lexeme regex="true">
    <grapheme>([0-9]+)-([0-9]+)</grapheme>
    <alias>between $1 and $2</alias>
 </lexeme>
</lexicon>

The regex feature works only with alias-based substitutions.

Whether or not the regex attribute is set to 'true', the grapheme matching can be made more accurate by specifying the 'positive-lookahead' and 'negative-lookahead' XPath-regex attributes:

<lexicon version="1.0" alphabet="ipa" xml:lang="en"
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon">
  <lexeme>
    <grapheme positive-lookahead="[ ]+is">SB</grapheme>
    <alias>somebody</alias>
  </lexeme>
  <lexeme>
    <grapheme>SB</grapheme>
    <alias>should be</alias>
  </lexeme>
  <lexeme xml:lang="fr">
    <grapheme positive-lookahead="[ ]+[cC]ity">boston</grapheme>
    <phoneme>bɔstøn</phoneme>
  </lexeme>
</lexicon>

Graphemes with 'positive-lookehead' will match if what follows begins with the regex of 'position-lookahead'. Graphemes with 'negative-lookehead' will match if what follows does not begin with the regex of 'negative-lookahead'. The lookaheads are case-sensitive while the grapheme contents are not.

The lexemes are reorganized so as to be matched in this order: 1. Graphemes with regex='false' come first, no matter if there is a look-ahead or not; 2. Graphemes with regex='true' and no look-ahead; 3. Graphemes with regex='true' and one or two look-aheads.

Within these categories, lexemes are matched in the same order as they appear in the lexicons.

Aural CSS

The text-to-speech voices and prosody can be configured with Aural CSS. To do so, add one or more 'css' elements to the configuration file. If the href attribute is missing, the CSS stylesheets will be interpreted as inlined in the configuration file:

<config>
  <css>
    p {
	volume: soft;
	voice-family: female;
    }
  </css>
</config>

This config entry will cause the pipeline to apply a local Aural CSS stylesheet on the input document. It supports only a subset of Aural CSS 2.1:

  • speak: none | spell-out

  • pause, pause-before, pause-after: <duration>

  • volume: <number> | silent | x-soft | soft | medium | loud | x-loud

  • pitch: x-low | low | medium | high | x-high

  • pitch-range: <number>

  • speech-rate: <number> | x-slow | slow | medium | fast | x-fast

  • speak-numeral: digits | continuous

  • voice-family

  • cue, cue-before, cue-after: url(<uri>)

In addition to this option, local CSS stylesheets referenced in the Processing-Instructions and by links in the headers will be loaded too.

Inspired by the specifications of CSS3 Speech, 'voice-family' is a comma-separated list of voice characteristics that place conditions on the voice selection.

If a full voice name is provided, e.g. "acapela, alice", this voice will be selected regardless of the document language. If this voice is not available, a fallback voice will be chosen such that it will match with the same characteristics as those of the requested voice: same language, same engine, same gender. If none is available, the pipeline broadens its search by relaxing the criteria: first the gender is relaxed and then the engine.

If no voice name is provided, e.g. "acapela" alone, "female" or "female, old" , the selection algorithm will take into consideration only the voices that match the current language. It starts by looking for a voice with the specified gender and supplied by the specified engine, and will broaden to any gender if the first search yielded no results. If neither the gender nor the engine match, language will be the only criterion.

When multiple voices match the criteria, the algorithm chooses the voice with the highest priority. A preset of priorities is already embedded in the pipeline, though they can be overridden via the 'voice' entries of the configuration file, as follows:

<config>
  <voice engine="sapi" name="Microsoft Todd" gender="male-adult" priority="100" lang="en"/>
</config>

Notice that it is also a convenient way to add voices that are not natively supported by the Pipeline. In the example above, Todd is now a registered voice and, as such, can be selected automatically by the Pipeline when the document is written in English.

AT&T, eSpeak and Acapela’s voice names can be found in their corresponding documentation. For Windows users, SAPI voices are enumerated in the system settings, e.g. Start > All Control Panel Items > Speech Recognition > Advanced Speech Options. You will also need to know the value of the "engine" attribute. This attribute must take as value one of the following:

  • 'att' for AT&T voices;

  • 'espeak' for eSpeak voices;

  • 'acapela' for Acapela voices;

  • 'osx-speech' for Apple voices;

  • 'sapi' for Microsoft voices or for any other voice installed to work with the SAPI engine, including some versions of AT&T and Acapela’s products.

In case of any doubt, engines and voice names can be retrieved from the server’s log in which all the voices are enumerated:

Available voices:
* {engine:'sapi', name:'NTMNTTS Voice (Male)'} by sapi-native
* {engine:'acapela', name:'alice'} by acapela-jna

Annotations (Daisy3 output only)

Not documented yet.

SAPI configuration

SAPI’s configuration works a bit differently. It has two special properties which cannot be overridden at runtime. They are set once and for all the first time SAPI is used:

'sapi.bytespersample': output number of bytes per audio sample (default=2)

'sapi.samplerate': SAPI’s output sample rate

The server must be restarted to change those properties. In addition, SAPI can be configured via two regular properties:

'sapi.priority': SAPI’s overall priority

'sapi.handle.marks': you may set this property to 'false' if you have installed new voices and one of them can’t interpret Speech Synthesis Markup Language (SSML) or mark-based audio synchronization, which may compromise SAPI’s initialization. Microsoft’s built-in voices do handle SSML and marks.

Lame Configuration

Lame will be partially configurable using the configuration file in future versions of the Pipeline eventually, though for now Lame’s options must be configured using static system properties. The server loads them once and for all when the JVM starts. They cannot be changed during runtime. They are stored in etc/system.properties of the pipeline distribution.

'lame.path': path of Lame’s binary (if not already in the system PATH)

'lame.options': command line options

The Log File (Daisy3 output only)

This is an optional entry of the configuration file.

<config>
  <property key="log" value="true"/>
</config>

This results in the Pipeline logging stuff in the output directory in a file named 'tts-log.xml'. The Pipeline will log a great deal of information to this file, which can be quite helpful for troubleshooting. Most of the log entries concern particular chunks of text of the input document. For more general errors, see the main server’s logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment