We propose a novel semi-supervised technique that enables expressive
style control and cross-speaker transfer in neural text to speech (TTS),
when available training data contains a limited amount of labeled expressive
speech from a single speaker. The technique is based on unsupervised
learning of a style-related latent space, generated by a previously
proposed reference audio encoding technique, and transforming it by
means of Principal Component Analysis to another low-dimensional space.
The latter space represents style information in a purified form, disentangled
from text and speaker-related information. Encodings for expressive
styles that are present in the training data are easily constructed
in this space. Furthermore, this technique provides control over the
speech rate, pitch level, and articulation type that can be used for
TTS voice transformation.
We present the results
of subjective crowd evaluations confirming that the synthesized speech
convincingly conveys the desired expressive styles and preserves a
high level of quality.