In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.8 BLEU on average on all four considered CoVoST 2 language pairs via a simple recipe of combining wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Different from existing work, our approach does not leverage any other supervision than ST data. Code and models are publicly released.
Cite as: Wang, C., Wu, A., Pino, J., Baevski, A., Auli, M., Conneau, A. (2021) Large-Scale Self- and Semi-Supervised Learning for Speech Translation. Proc. Interspeech 2021, 2242-2246, doi: 10.21437/Interspeech.2021-1912
@inproceedings{wang21r_interspeech, author={Changhan Wang and Anne Wu and Juan Pino and Alexei Baevski and Michael Auli and Alexis Conneau}, title={{Large-Scale Self- and Semi-Supervised Learning for Speech Translation}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={2242--2246}, doi={10.21437/Interspeech.2021-1912} }