Publikation
Toward a speech synthesis guided by the modeling of unexpected events
Sébastien Le Maguer; Ingmar Steiner; Bernd Möbius
In: Antje Schweitzer; Grzegorz Dogil (Hrsg.). Workshop on Modeling Variability in Speech. Workshop on Modeling Variability in Speech, October 1-2, Stuttgart, Germany, 10/2015.
Zusammenfassung
Over the last 30 years, text to speech (TTS) methodologies have evolved from the
selection of real units to the use of complex statistical modeling. However, all state-ofthe-art TTS methodologies use descriptive features, extracted from the text, to achieve
the synthesis. Therefore, these features are as crucial as the modeling itself to improve
the quality of the achieved synthesis.
Currently, descriptive features are mainly derived from low-level linguistic information, such as syllable stress or content information of the word. In this study, we want
to capture prosodic effects by applying new descriptive features based on the surprisal
of the syllable or the word. Here, the concept of surprisal is borrowed from the field of
information theory. Its purpose is to quantify the unpredictability of an event. In practice, it is computed as the negative log probability of an event, given a specific context.
Our assumption is, the higher the surprisal of an event (i.e., the occurrence of a syllable
or a word), the higher its effect on prosodic features.
The MaryTTS system [1] provides a deeply modular speech synthesis framework
based on unit selection or hidden Markov model based speech modeling. Therefore,
using this system, we have the possibility to conduct a study on the influence of the
surprisal on the achieved synthesis on both of these standard methodologies. To this
end, we assume the baseline descriptive feature set already used in MaryTTS.
We first propose to enrich the baseline descriptive feature set by adding (a) the
surprisal of the syllable; (b) the surprisal of the word; (c) both surprisal of the syllable
and the word.
In order to analyze the use of such descriptive features in place of traditional ones, we
also propose two alternatives to the baseline descriptive feature set. These alternatives
are obtained by (1) replacing the accent information with the surprisal of the syllable;
(2) replacing the content information with the surprisal of the word.
Consequently, using these combinations, we expect to qualify the influence of the
surprisal on the achieved synthesis. We also plan to assess the use of such high-level
descriptive features in a speech synthesis task.