ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

A Study on The Impact of Foundation Models on Automatic Depression Detection from Speech Signals

Bubai Maji, Monorama Swain, Shazia Nasreen, Debabrata Majumdar, Rajlakshmi Guha, Aurobinda Routray, Anders Søgaard

An automatic depression detection (ADD) system using spoken language offers the opportunity to develop practical, low-cost tools to detect symptoms early. However, limited data availability, privacy concerns, and transcription efforts pose significant challenges. Recent advancements in foundational models, capable of understanding and processing multimodal inputs, present opportunities for enhancing ADD systems. This study explores various speech foundation models to investigate their impact on ADD. We leverage Whisper and MMS for automatic transcription and integrate speech and text embeddings into a language model optimized with low-rank adaptation (LoRA). In addition, we examine the effects of fine-tuning strategies and prompt formats on model performance. We used English and Bengali datasets to demonstrate the potential of our method in ADD, even with moderate-quality transcriptions. The best speech and language foundation models outperform baseline models on both datasets.