ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis

Christina Tånnander, Shivam Mehta, Jonas Beskow, Jens Edlund

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Erratum

The illustration in figure 2 shows the F1 values in reversed order. The corrected figure is shown here:

Figure 2: Counts of perception of / ɪ , ɛ, æ/ (left y-axis) for each of the 9 settings (x-axis) of V-HEIGHT. Training target phonemes labels are aligned with their rank order. Average F1 value is shown as black dots (right y-axis).