Transformer-based encoder-decoder architectures have recently shown promising results in end-to-end speech translation. However, the content-based attention mechanism employed by the Transformer was designed for text sequences and can only encode global inductive bias, that alone is not sufficient for learning good representations from speech signals. In this work, we address this by putting architectural constraints on the Transformer to allow encoding of both local and global inductive biases. This is accomplished by replacing the Transformer encoder with a Conformer encoder that, in contrast to the Transformer encoder, employs convolution in addition to self-attention and feed-forward. As a result, the new model named Conformer-Transformer has an encoder that captures both local feature correlations and long-range dependencies from speech signals. Experiments on seven non-English to English language directions show that the Conformer-Transformer, compared to strong Transformer-based baselines, achieves up to 3.54 BLEU score improvements with a pre-trained encoder and up to 10.53 BLEU score improvements when trained from scratch.