Convolution augmented Transformer architectures have dominated the field of automatic speech recognition by showing better WER results when the models are trained on relatively smaller training data. In this work, we revisit the necessity of convolution modules in the ASR encoder architecture, given that the inductive bias brought by the convolution modules may only boost performance in a low training data regime. We show that with architectural improvements to the Transformer block, a convolution-free Transformer architecture (namely, Transformer++) can catch up with the best Conformer WER results as we scale up the training data. Moreover, we demonstrate that with large scale unsupervised pre-training, the proposed Transformer++ can achieve even better WER than the best Conformer results. Importantly, Transformer++ achieves state-of- the-art performance with top efficiency, where we show 40% CPU inference realtime factor (RTF) improvement and 25% GPU training speedup compared to Conformer.