ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Wav2vec behind the Scenes: How end2end Models learn Phonetics

Teena tom Dieck, Paula Andrea Pérez-Toro, Tomas Arias, Elmar Noeth, Philipp Klumpp

End2end models became extremely popular in recent years. Whilst they excel at tasks like acoustic modelling or full-fledged speech recognition, the decision making process can be quite complex to retrace due to their black-box character. As end2end models learn high-level feature extraction on-the-fly, outputs from hidden layers from within the network had been used as feature vectors in various studies to perform transfer learning. It is therefore crucial to understand how extracted hidden activations transport information collected from the signal. Furthermore, is the traditional categorization into feature extractor and temporal analysis still applicable on the sub-parts of end2end models? By the example of Wav2vec 2.0, we show how an acoustic model learns to perform a frequency analysis on a speech waveform. Our experiments also show that phonetic information about speech production is preserved in extracted feature vectors. Ultimately, our findings highlight how different parts of an end2end model encode information on an entirely different level. Whilst the influence of gender is quite large on early feature vectors, it vanished after temporal contextualization. At the same time, hidden activations which included context information were superimposed by language-related patterns.