We present a new approach towards using contextual information to enhance speech recognition and understanding. Dynamically inferred knowledge about the context is used in addition to the static linguistic and domain specific knowledge. Based on the results of image analysis of a given scene language models for constituents of possible utterances concerning that scene are generated.