Enabling continual learning (CL) from an ever-changing environment is highly valuable, but it poses significant challenges for spoken keyword spotting (KWS), which simultaneously deals with both variability in acoustic characteristics of speech signals and catastrophic forgetting issues. In this paper, we propose a novel framework for replay-based CL in KWS that uses a Dual-Memory Multi-Modal (DM3) structure to enhance generalizability and robustness. Our approach leverages short-term and long-term models to learn near-term and long-term knowledge in an adaptive manner with a dual-memory structure, while also exploiting the consistency of multiple speech perturbations to improve the robustness with a multi-modal structure. Additionally, we introduce a class-balanced selection strategy that uses confidence scores to sort training samples. Experiments demonstrate the effectiveness of our method over competitive baselines in class incremental learning and domain incremental learning KWS settings.