Inspired by EfficientTTS, a recent proposed speech synthesis model, we propose a new way to train end-to-end speech recognition models with an additional training objective, allowing the models to learn the monotonic alignments effectively and efficiently. The introduced training objective is differential, computationally cheap and most importantly, of no constraint on network structures. Thus, it is quite convenient to be incorporated into any speech recognition model. Through extensive experiments, we observed that the performance of our models significantly outperform baseline models. Specifically, our best performing model achieves WER (Word Error Rate) 3.18% on LibriSpeech test-clean benchmark and 8.41% on test-other. Comparing with a strong baseline obtained by WeNet, the proposed model gets 7.6% relative WER reduction on test-clean and 6.9% on test-other.