ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous Speech

Loukas Ilias, Dimitris Askounis

Depression is a serious mood disorder, which affects the way people feel and perform daily activities. Speech is a reliable biomarker for diagnosing depression, since depressed people present decreased verbal activity productivity and “lifeless” sounding speech. Existing methods employ unimodal models, use early, intermediate, or late fusion strategies to fuse the different modalities, rely on feature extraction, and perform their approaches only in the English language. This study presents a new method for identifying depression from spontaneous speech in the Italian language, which uses a cross-attention layer for capturing the cross-modal interactions, followed by a variety of multimodal fusion methods. We also perform a multi-task learning framework to explore whether the prediction of age, education level, and gender help in recognizing depression. Findings show that our approach yields multiple advantages over existing approaches reaching Accuracy up to 95.29%.