If multimodal systems are strictly based on the pattern of natural conversation, problems arise that are related to the transient nature of speech that strongly occupies the users attention. The concept of visual utterance is introduced that allows for a strictly user driven interaction, by preserving a conversational style of communication. Important features of visual utterances are that clarification and modifications of user requests can be based on a structured presentation of the systems interpretation, that the user can interact with these presentations not only by speech, but also by gestures and te xtual input, and that the systems view on focus is pr esented visually b Smileys, attached to the utterance. Certain problems of visual utterances are discussed. Also the problem of the distinction between deictic and manipulative uses of gestures is touched.