Reducing the input sequence length of speech features to alleviate the complexity of alignment between speech features and text transcript by sub-sampling approaches is an important way to get better results in end-to-end (E2E) automatic speech recognition (ASR) systems. This issue is more important in Transformer-based ASR, because the self-attention mechanism in Transformers has O(n2) order of complexity in both training and inference. In this paper, we propose a Transformer-based ASR model with the time-reduction layer, in which we incorporate time-reduction layer inside transformer encoder layers in addition to traditional sub-sampling methods to input features that further reduce the frame-rate. This can help in reducing the computational cost of the self-attention process for training and inference with performance improvement. Moreover, we introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model. Experiments on LibriSpeech datasets show that our proposed methods outperform all other Transformer-based ASR systems. Furthermore, with language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models with just 30 million parameters trained without any external data.