Automatic Speech Recognition (ASR) model training requires large amounts of paired data, i.e. audio/text pairs. However, such paired data is expensive to collect and even harder to annotate as opposed to using unpaired text data. With increasingly better speech synthesis models, we can now generate natural-sounding speech and utilize large amounts of unpaired text. In this paper, we use the Voicebox model for speech synthesis. Firstly, we assess synthetic speech quality by comparing the amount of synthetic speech required to obtain the same ASR performance as real speech. We find that in noisy settings 10 times more synthetic data than real data is required to achieve equal performance whereas in clean settings, only 7 times more is needed. Secondly, we explore the improvements in the ASR performance brought by the acoustic variability and lexical variability from the unpaired text and synthesized speech. We find that having both acoustic and lexical variability is better than either one individually. Having lexical variability is better on average than acoustic variability when there are smaller amounts of unpaired text, however, acoustic variability becomes more important as the amount of unpaired text increases.