ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Knowledge Distillation on Joint Task End-to-End Speech Translation

Khandokar Md. Nayem, Ran Xue, Ching-Yun Chang, Akshaya Vishnu Kudlu Shanbhogue

An End-to-End Speech Translation (E2E-ST) model takes input audio in one language and directly produces output text in another language. The model requires to learn both speech-to-text modality conversion and translation tasks, which demands a large architecture for effective learning of this joint task. Yet, to the best of our knowledge, we are the first to optimize compression of E2E-ST models. In this work, we explore knowledge distillation for a cross-modality joint-task E2E-ST system from 3 dimensions: 1) student architecture and weight initialization scheme, 2) importance of loss terms associated with different tasks and data modalities, 3) knowledge distillation training scheme customized for the multi-task/module model. Comparing with the full size model, our compressed model's encoder and decoder size are 50% smaller, while it retains 90% and > 95% performance on speech translation task and machine translation task respectively on MUST-C en→de testset.