ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Pull It Together: Reducing the Modality Gap in Contrastive Learning

Amit Sofer, Yoav Goldman, Shlomo E. Chazan

Contrastive learning has become a powerful strategy for aligning different modalities in a shared embedding space. Contrastive Language–Image Pre-training (CLIP) has achieved remarkable performance across various downstream tasks. This methodology has been extended to the audio-text domain through Contrastive Language–Audio Pre-training (CLAP), demonstrating strong performance in related tasks. However, recent work highlights a modality gap in CLIP’s embedding space, where embeddings from different modalities remain partially separated rather than fully integrated. In this paper, we begin by analyzing the CLAP embedding space and identify a similar modality gap. Furthermore, we propose a novel solution combining a modality classifier with a Gradient Reverse Layer (GRL) to reduce this gap. Our experiments on CLIP and CLAP confirm that our approach reduces the modality gap while improving performance, and even achieving new State Of The Art (SOTA) results in text-audio retrieval.