ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Audio Retrieval with WavText5K and CLAP Training

Soham Deshmukh, Benjamin Elizalde, Huaming Wang

Text-based audio retrieval takes a natural language query to retrieve relevant audio files in a database. Most retrieval models are trained, optimized, and evaluated on a single dataset. In this paper, we quantify the effect of adding training data using three datasets and the effect on performance by evaluating the same model on two datasets. For our study, first, we introduce a new collection of about 5000 audio-text pairs called WavText5K. We qualitatively show how WavText5K differs from audio-text datasets and quantitatively show its effectiveness for retrieval. Our results show that adding more audio-text pairs does not necessarily improve performance. Second, we compare two effective audio encoders: CNN and audio transformers. We propose an architecture that demonstrates that utilizing both encoders improves the individual model's performance. Overall, using WavText5K and the proposed encoder combination outperforms the benchmark for AudioCaps and Clotho by 6% and 23%.