We consider a challenging problem of LARCQ (Long Audio Retrieval with Complex Queries), where audios to retrieve may be arbitrarily long and queries span multiple events. To solve it, we propose a novel pipeline systematically integrating multi-modal retrieval and ALM/LLM refining. At Steps 1 and 2, we introduce a chunking-aggregation method to retrieve candidate audios by constructing a similarity matrix. At Steps 3 and 4, audio captions are generated for retrieved candidates using ALMs, and the final audio is selected by comparing the query with generated captions through text LLMs/classifiers. Due to lack of benchmarks for LARCQ, we introduce Clotho-LARCQ and SoundDescs-LARCQ featuring long audios and complex queries. Our chunking-aggregation method achieves up to 67% R@1 and 40% R@5 gains. Incorporating ALM/LLM refining, our full pipeline achieves 21% R@1 on the original Clotho benchmark and up to 100% R@1 improvement on new LARCQ benchmarks.