ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Contrastive Learning-based Syllable-Level Mispronunciation Detection and Diagnosis for Speech Audiometry

Longbin Jin, Donghun Min, Jung Eun Shin, Eun Yi Kim

Speech audiometry assesses hearing disorders, typically relies on audiologists, making the process subjective and requiring in-person evaluation. In this paper, we introduce SylPh, a novel automatic syllable-level mispronunciation detection and diagnosis (MDD) model that generalizes across open-set syllables while also offering phonemic analysis. To capture a wide range of mispronunciation patterns, we construct positive and pseudo-negative bags to extract in-distribution and out-of-distribution features from input audio. Our model aligns audio features with adaptive text embeddings using a contrastive objective, dynamically adjusting decision boundaries for each syllable within a single model. Extensive experiments on a large-scale dataset demonstrate its effectiveness in both closed-set and open-set syllables. Notably, despite training only on syllable-level labels, the Sylph has the capability to localize phoneme-level abnormalities, providing detailed diagnostic insights.