The additional modality (such as speech) in multimodal large language models (LLM) increases the vulnerability of these models to adversarial jailbreak attacks. Adversarial training (AT) techniques have shown great promise as defenses in traditional adversarial robustness literature. But they are less explored as countermeasures in speech-enabled LLMs due to the limited availability of training data and computational complexity. In this work, we develop AT techniques tailored to speech LLMs using a combination of synthesized harmful and benign queries. We experiment with different training data configurations, and evaluate the methods on strong white-box adversarial attacks. We demonstrate through extensive ablations that using just 4hrs of harmful speech queries for AT (with 150 hours of benign speech) can provide significant gains compared to vanilla safety fine-tuning, improving safety by 45%-300% relative depending on the model.