In this paper we describe our children’s Automatic Speech Recognition (ASR) system for the first shared task on ASR for English non-native children’s speech. The acoustic model comprises 6 Convolutional Neural Network (CNN) layers and 12 Factored Time-Delay Neural Network (TDNN-F) layers, trained by data from 5 different children’s speech corpora. Speed perturbation, Room Impulse Response (RIR), babble noise and non-speech noise data augmentation methods were utilized to enhance the model robustness. Three Language Models (LMs) were employed: an in-domain LM trained on written data and speech transcriptions of non-native children, a LM trained on non-native written data and transcription of both native and non-native children’s speech and a TEDLIUM LM trained on adult TED talks transcriptions. Lattices produced from the different ASR systems were combined and decoded using the Minimum Bayes-Risk (MBR) decoding algorithm to get the final output. Our system achieved a final Word Error Rate (WER) of 17.55% and 16.59% for both developing and testing sets respectively and ranked second among the 10 teams participating in the task.