ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

What Does an End-to-End Dialect Identification Model Learn About Non-Dialectal Information?

Shammur A. Chowdhury, Ahmed Ali, Suwon Shon, James Glass

An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model’s ability to represent speech input for differentiating non-dialectal information — such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality — and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.