ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Effects of Prosodic Information on Dialect Classification Using Whisper Features

Phoebe Parsons, Heming Strømholt Bremnes, Knut Kvale, Torbjørn Svendsen, Giampiero Salvi

In dialect identification (DID), a model needs to attend to subtle cues to distinguish between highly similar linguistic variants. However, the knowledge of which cues are important and why is limited. Inspired by the literature on human DID, we fine-tuned a Whisper model with modified audio to see how deprivation of various signal components would impact performance. Specifically, the audio manipulation sought to either isolate or remove (tonal) prosodic information, by either low-pass filtering or monotonizing F0, respectively. Results indicate that fine-tuning on low-pass filtered data produces a significant improvement over unmodified data. Utilizing sensitivity maps in the frequency domain, we argue that the low-pass model is able to devote more attention to lower frequency bands, thus exploiting task-relevant pitch dynamics. Though only evaluated with Norwegian, we suggest that our methodology should generalize, encouraging improvement in DID and its downstream applications.