ISCA Archive SynData4GenAI 2024
ISCA Archive SynData4GenAI 2024

From synthetic data to multimodal foundation models

Andrew Rosenberg

This talk will describe a thread of research that starts with the use of synthetic speech to train speech recognition models, and ends with the joint modeling of speech and text in multimodal foundation models. Along the way, I'll describe work using synthetic speech for training self-supervised pretraining models. This work serves as a transition into text-injection for speech recognition. Finally, I'll describe how this work results in a multimodal foundation model that can also perform speech synthesis (Virtuoso).