ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

LombardTokenizer: Disentanglement and Control of Vocal Effort in a Neural Speech Codec

Maxime Jacquelin, Maëva Garnier, Laurent Girin, Rémy Vincent, Olivier Perrotin

Disentangling distinct types of information in speech representations is crucial for improving speech synthesis and voice conversion systems. In this work, we introduce LombardTokenizer, a neural speech codec able to separate features related to vocal effort from other acoustic (and semantic) information. This model is built on SpeechTokenizer, a model proposed in the literature based on multi-stage quantisation, which focused on isolating semantic content in its first quantisation layer. We show that the level of vocal effort can be effectively captured in the second quantisation layer by conditioning the quantisation layer with neural encoders trained to represent vocal effort. Experimental results demonstrate that the proposed method significantly outperforms existing methods in speech conversion between neutral and Lombard speech, while maintaining excellent speech synthesis quality, offering improved control over vocal effort and naturalness of synthesised speech.