In this paper, a semi-tight coupling between visual and auditory modalities is proposed: in particular, eye fixation information is used to enhance the output of speech recognition systems. This is achieved by treating natural human eye fixations as diectic references to symbolic objects, and passing on this information to the speech recognizer. The speech recognizer biases its search towards these set of symbols/words during the best word sequence search process. As an illustrative example, the TRAINS interactive planning assistant system has been used as a test-bed; eye-fixations provide important cues to city names which the user sees on the map. Experimental results indicate that eye fixations help reduce speech recognition errors. This work suggests that integrating information from different interfaces to bootstrap each other would enable the development of reliable and robust interactive multi-modal human- computer systems.