Just as we humans both perceive and produce information across multiple modalities, so should our generative AI models. However, the classic supervised approach to training multimodal systems requires parallel training data across all modalities simultaneously, which can be much more scarce than data from individual modalities. Foundation models and synthetic data offer a possible way to mitigate this problem. In this talk I review recent work in the multimodal synthesis of human communication – specifically speech audio and 3D motion (co-speech gestures) from text – and describe a straightforward method for creating synthetic data that improves the training of these models, as an example of possible uses of synthetic data for the benefit of multimodal GenAI.