The field of Text-to-Speech has experienced huge improvements last
years benefiting from deep learning techniques. Producing realistic
speech becomes possible now. As a consequence, the research on the
control of the expressiveness, allowing to generate speech in different
styles or manners, has attracted increasing attention lately. Systems
able to control style have been developed and show impressive results.
However the control parameters often consist of latent variables and
remain complex to interpret.
In this paper, we
analyze and compare different latent spaces and obtain an interpretation
of their influence on expressive speech. This will enable the possibility
to build controllable speech synthesis systems with an understandable
behaviour.