ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Segmental SpeechCLIP: Utilizing Pretrained Image-text Models for Audio-Visual Learning

Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak

Visually grounded models learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP attempted to generate localized word-level information and utilize both the pretrained image and text encoders. Despite using both, they noticed a substantial drop in retrieval performance. Here, we propose to use a hierarchical segmental audio encoder that can generate a sequence of word-like units from audio. We use the pretrained CLIP text encoder on top of these word-like units representations and show significant improvements over the cascaded variant of SpeechCLIP.