ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

End to End Spoken Language Diarization with Wav2vec Embeddings

Jagabandhu Mishra, Jayadev N Patil, Amartya Chowdhury, Mahadeva Prasanna

The performance of the available end-to-end (E2E) spoken language diarization (LD) systems is biased toward primary language. This is due to the unavailability of sufficient secondary language data. Because in code-switched (CS) utterances, the duration of the primary language is significant over the secondary language. Hence, to resolve the issue, this work initially uses wav2vec (W2V) pre-trained embedding in place of x-vector and can reduce the primary language bias and provides a relative improvement of 30.7% in terms of Jaccard error rate (JER) over the baseline x-vector based E2E (X-E2E) framework. Further, the performance of LD is improved by fine-tuning the W2V embedding extractor and modifying the temporal aggregation strategy from statistical pooling to attention pooling. The Final performance achieved in terms of JER is 22.5, which provides a relative improvement of 38.8% and 62.6% over the standalone W2V fine-tuned and the baseline X-E2E framework, respectively.