Encoders are crucial for speech recognition, and boosting their computation enhances feature quality. However, the common practice of scaling model size to achieve this is becoming increasingly costly with large models. This paper proposes a novel "contemplative mechanism" designed to enhance encoder quality without increasing model size. Our core innovation lies in strategically interleaving special "think tokens" within the input sequence of speech tokens during both training and inference. This mechanism encourages deeper processing of the original input and leads to improved feature representations. We demonstrate the effectiveness of this mechanism on various speech recognition datasets and encoder architectures. Experiments show that inserting a single think token per speech token can yield accuracy gains equivalent to doubling the model size. While focused on the speech domain, our method holds promise for improving encoders in other modalities as well.