ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

On Retrieval of Long Audios with Complex Text Queries

Ruochu Yang, Milind Rao, Harshavardhan Sundar, Anirudh Raju, Aparna Khare, Srinath Tankasala, Di He, Venkatesh Ravichandran

We consider a challenging problem of LARCQ (Long Audio Retrieval with Complex Queries), where audios to retrieve may be arbitrarily long and queries span multiple events. To solve it, we propose a novel pipeline systematically integrating multi-modal retrieval and ALM/LLM refining. At Steps 1 and 2, we introduce a chunking-aggregation method to retrieve candidate audios by constructing a similarity matrix. At Steps 3 and 4, audio captions are generated for retrieved candidates using ALMs, and the final audio is selected by comparing the query with generated captions through text LLMs/classifiers. Due to lack of benchmarks for LARCQ, we introduce Clotho-LARCQ and SoundDescs-LARCQ featuring long audios and complex queries. Our chunking-aggregation method achieves up to 67% R@1 and 40% R@5 gains. Incorporating ALM/LLM refining, our full pipeline achieves 21% R@1 on the original Clotho benchmark and up to 100% R@1 improvement on new LARCQ benchmarks.