ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Leveraging Unlabeled Audio for Audio-Text Contrastive Learning via Audio-Composed Text Features

Tatsuya Komatsu, Hokuto Munakata, Yuchi Ishikawa

We propose a novel approach to audio-text contrastive learning that leverages unlabeled audio by introducing audio-composed text features. First, we generate composed audio by additively combining labeled and unlabeled audio. To obtain a text feature aligned with this newly composed audio, we introduce an audio-to-text (a2t) module that transforms the features of unlabeled audio into the corresponding text feature. The newly generated text feature is then concatenated with the original text of the labeled audio and passed through a text encoder to produce the audio-composed text features. By pairing these features with the composed audio for contrastive learning, our approach effectively integrates information from both labeled and unlabeled data. In audio-text retrieval experiments on Clotho and AudioCaps, the proposed method achieves notable improvements in Recall@1, with relative gains of 9.3% and 13.6%, respectively, compared to those trained solely with labeled audio.