In this paper, we propose an end-to-end (E2E) dialect identification
system trained using transfer learning from a multilingual automatic
speech recognition (ASR) model. This is also an extension of our submitted
system to the Oriental Language Recognition Challenge 2020 (AP20-OLR).
We verified its applicability using the dialect identification (DID)
task of the AP20-OLR. First, we trained a robust conformer-based joint
connectionist temporal classification (CTC) /attention multilingual
E2E ASR model using the training corpora of eight languages, independent
of the target dialects. Second, we initialized the E2E-based classifier
with the ASR model’s shared encoder using a transfer learning
approach. Finally, we trained the classifier on the target dialect
corpus. We obtained the final classifier by selecting the best model
from the following: (1) the averaged model in term of the loss values;
and (2) the averaged model in term of classification accuracy.
Our experiments on
the DID test-set of the AP20-OLR demonstrated that significant identification
improvements were achieved for three Chinese dialects. The performances
of our system outperforms the winning team of the AP20-OLR, with the
largest relative reductions of 19.5% in Cavg and
25.2% in EER.