This paper proposed a procedure for detecting and recognizing mispronunciations in training data, and improved non-native acoustic modeling by training with the corrected phone alignments. To start, an initial phone sequence for an utterance is derived from its word-level transcription and a dictionary of canonical pronunciation. Following that, the region of mispronunciation is detected through examining phone-level goodness-of-pronunciation (GOP) scores. Then over the region, a constrained phone decoder is used to recognize the most likely pronounced phone sequence from all the possible phone sequences with one phone edit distance from the initial phone sequence. After updating the phone alignments and GOP scores, this detection and recognition procedure is repeated until no more mispronunciation is detected. Experiments on a 300-hour non-native spontaneous dataset showed that the acoustic model trained from the proposed procedure reduced WER by 6% compared to a well optimized context-dependent factorized-TDNN HMM baseline system with the same neural network topology. This work also offered a data-driven approach for generating a list of common mispronunciation patterns of non-native English learners that may be useful for speech assessment purpose.