ISCA Archive Odyssey 2024
ISCA Archive Odyssey 2024

Multimodal Learning of Speech and Speaker Representations

Joon Son Chung

Supervised learning with deep neural networks has brought phenomenal advances to speech recognition systems, but such systems rely heavily on annotated training datasets. On the other hand, humans naturally develop an understanding about the world through multiple senses even without explicit supervision. We attempt to mimic this human ability by leveraging the natural co-occurrence between audio and visual modalities. For example, a video of someone playing a guitar co-occurs with the sound of a guitar. Similarly, a person’s appearance is related to the person’s voice characteristics, and the words that they speak are correlated to their lip motion. We use unlabelled audio and video for self-supervised learning of speech and speaker representations. We will discuss the use of the learnt representations for speech-related downstream tasks such as automatic speech recognition, speaker recognition and lip reading.