Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual context through cross-modal influences. During interpretation of speech, visual context seems to steer speech processing and vice versa. We present a real-time multimodal system motivated by these findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances. The system first acquires a grammar and a visually grounded lexicon from a "show-and-tell" procedure where the training input consists of camera images consisting of sets of objects paired with verbal object descriptions. Given a new scene, the system generates a dynamic visually-grounded language model and drives a dynamic model of visual attention to steer speech recognition search paths towards more likely word sequences.