We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.
The illustration in figure 2 shows the F1 values in reversed order. The corrected figure is shown here: