Benchmarks for language-guided embodied agents typically assume text-based instructions, but deployed agents will encounter spoken instructions. While Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous ASR transcripts can hurt the agents’ ability to complete tasks. We propose training a multimodal ASR model that utilizes the accompanying visual context to reduce errors in spoken instruction transcripts. We train our model on a dataset of synthetic spoken instructions, derived from the ALFRED household task dataset, where we simulate acoustic noise by systematically masking spoken words. We find that utilizing visual observations facilitates masked word recovery, with multimodal ASR models recovering up to 30% more masked words than unimodal baselines. We also find that spoken instructions transcribed by multimodal ASR models result in higher task completion success rates for a language-guided embodied agent. github.com/Cylumn/embodied-multimodal-asr