ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Speech and Text Foundation Models for Depression Detection: Cross-Task and Cross-Language Evaluation

Lucía Gómez-Zaragozá, Javier Marín-Morales, Mariano Alcañiz, Mohammad Soleymani

Automated depression detection is gaining attention due to its potential to improve psychiatric care. This study compares the performance of foundation models (FMs) in two datasets: an extended Distress Analysis Interview Corpus (DAIC+) in English, and Depressive Indicators during Casual Talks (DEPTALK) dataset in Spanish. HuBERT models and their fine-tuned versions for emotion recognition (ER) are used for speech. RoBERTa models and their ER variants are applied for text. Representations from FMs are grouped into context windows and processed by a Gated Recurrent Unit. Early fusion is used for multimodal analysis. Speech models perform similarly across datasets (F1≈0.60). Text models perform better on DAIC+ than on DEPTALK (F1=0.70 vs 0.45). Multimodal models using FMs fine-tuned for ER perform best for both (F1=0.75 in DAIC+, 0.69 in DEPTALK), showing effectiveness across tasks and languages. Fairness evaluation reveals gender bias, which motivates future research on its alleviation.