In this paper we present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. We have developed a modular system for flexible human-computer interaction via speech. In order to give the speaker reasonable freedom of movement within a room, the speaker's face is automatically acquired and followed by a face tracker subsystem, which delivers constant size, centered images of the face in real time. The image of the lips is automatically extracted from the camera image of the speaker's face by the lip tracker module, which can track the lips in real time. Furthermore, we show how the system deals with problems in real environments such as different illuminations and image sizes, and how the system adapts automatically to different noise conditions.