ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

GPA: Global and Prototype Alignment for Audio-Text Retrieval

Yuxin Xie, Zhihong Zhu, Xianwei Zhuang, Liming Liang, Zhichang Wang, Yuexian Zou

Recent Audio-Text Retrieval (ATR) models have achieved progressive results, which pursue semantic interaction upon audio and text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning between audio and text. In this paper, we present GPA for ATR to achieve both Global (coarse-grained) and Prototype (fine-grained) Alignment. In detail, apart from performing vanilla global contrast between audio and text pairs, we model the frames in audio and words in text as prototypes, and align the prototypes to generate a prototype similarity matrix. Based on this, we introduce a Learnable Attention Similarity Scoring module, which can fully consider the information between different prototype pairs and obtain the retrieval score. Finally, we incorporate the Sinkhorn-Knopp algorithm to modify the retrieval score. Experimental results on two benchmark datasets with superior performance justify the efficacy of our proposed GPA.