ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

M3L: A Multi-Modal and Multi-Lingual Depression Detection Framework

Jiajun You, Shuai Wang, Xun Gong, Xiang Wan

Early diagnosis are essential to reduce costs and improve treatment efficiency. Recently, automatic depression detection (ADD) based on audio and textual features from participant interviews has emerged as a promising approach, attracting significant attention. However, existing models are constrained to monolingual depression datasets, with limited exploration of multi-lingual scenarios. To investigate the effectiveness of multi-lingual data for the ADD task and its transferability in low-resource scenarios, in this paper, we propose a Multi-Modal Multi-Lingual (M3L) depression detection framework and an effective language adaptive fine-tuning (LAFT) to further boost the performance on the target language. M3L utilizes the pretrained speech model Whisper and the text model XLM-RoBERTa to enhance the encoding of multilingual information. Evaluations on the DAIC-WOZ (English) and EATD (Chinese) datasets demonstrate that M3L effectively integrates multi-lingual and multi-modal information, while the proposed LAFT consistently boosts performance across both datasets.