The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.