Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems maintain full differentiability and effectively capture non-phonemic information, making them well-suited for spoken dialogue modeling. However, existing E2E approaches often require large-scale training data and struggle to produce semantically coherent responses. In this work, we propose a simple yet effective strategy leveraging a chain-of-thought (CoT) formulation, ensuring that training on conversational data remains closely aligned with the multimodal language model (LM)'s pre-training on speech recognition (ASR), text-to-speech synthesis (TTS), and text LM tasks. Our results demonstrate that our approach is highly compute-efficient, enabling the successful training of E2E spoken dialogue systems on publicly available human-human conversation datasets—even with as little as 300 hours of data, such as Switchboard. To support future research, we will publicly release our models and training code.