ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

A Method of Audio-Visual Person Verification by Mining Connections between Time Series

Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang, Honggang Zhang, Pengfei Hu

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. But the relationship of keyframes in time series between modalities seems to be unexplored. Hence, we proposed a novel audio-visual strategy that considers connections between time series from a generative perspective. First, we introduced weight-enhanced attentive statistics pooling to extend the salience of the keyframe weights. Then, joint attentive pooling incorporating 3 popular generative supervision models is proposed. Finally, each modality is fused with a gated attention mechanism to gain robust embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.14%, 0.21%, and 0.37% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.