ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Scaling Laws for Synthetic Speech for Model Training

Christoph Minixhofer, Ondřej Klejch, Peter Bell

We investigate how the scale of Text-to-Speech (TTS) models' training data influences Automatic Speech Recognition (ASR) performance when real training data is replaced entirely by synthetic speech. We propose an extension to established data scaling laws that incorporates an additional term capturing the mismatch between real and synthetic distributions in low-data regimes. We compare Mean Squared Error (MSE) and Denoising Diffusion Probabilistic Models (DDPMs) for TTS: MSE-based speech, though oversmoothed, provides stronger ASR results with smaller TTS datasets, while DDPM-based speech surpasses MSE once trained on enough data to better approximate the real distribution. Our findings also show that synthetic speech can only approximate or match real data performance if the TTS model itself is trained on a sufficiently large corpus, emphasizing that distribution coverage is crucial for fully synthetic ASR training.