Conformer is an extension of transformer-based neural ASR models whose fundamental component is the self-attention module. In this paper, we show that we can remove the self-attention module from Conformer and achieve the same or even better recognition performance for utterances whose length is up to around 10 seconds. This is particularly important for streaming interactive voice assistants as input is often very short and a fast response is expected. Since the computational complexity of self-attention is quadratic, this modification allows for faster, smaller sized models, two requirements for on-device applications. Using this finding, we propose Conmer, a neural architecture based on Conformer but without self-attention for streaming interactive voice assistants. We conduct experiments on public and real-world data and show the streaming Conmer reduces the WER and computational complexity relatively by 4.03% and 10%, respectively.