This talk will describe a thread of research that starts with the use of synthetic speech to train speech recognition models, and ends with the joint modeling of speech and text in multimodal foundation models. Along the way, I'll describe work using synthetic speech for training self-supervised pretraining models. This work serves as a transition into text-injection for speech recognition. Finally, I'll describe how this work results in a multimodal foundation model that can also perform speech synthesis (Virtuoso).