ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

TRILLsson: Distilled Universal Paralinguistic Speech Representations

Joel Shor, Subhashini Venugopalan

Recent advances in self-supervision have dramatically improved the quality of speech representations. However, deployment of state-of-the-art embedding models on devices has been restricted due to their limited public availability and large resource footprint. Our work addresses these issues by publicly releasing a collection of paralinguistic speech models that are small and near state-of-the-art performance. Our approach is based on knowledge distillation, and our models are distilled on public data only. We explore different architectures and thoroughly evaluate our models on the NonSemantic Speech (NOSS) benchmark. Our largest distilled model achieves over 96% the accuracy on 6 of 7 tasks, is less than 15% the size of the original model (314MB vs 2.2GB), and is trained on 6.5% the data. The smallest model achieves over 90% the accuracy on 6 of 7 tasks and is 1% in size (22MB). Our models outperform the 1.2GB open source Wav2Vec 2.0 model on 5 of 7 tasks despite being less than a third the size, and one of our models outperforms Wav2Vec 2.0 on both emotion recognition tasks despite being less than 4% the size.