Comparing to graphical user interfaces, speech-only interfaces face several problems: robustness, making clear what functionality is available, and making clear how the functionality may be accessed. We explore a potential solution for these problems by presenting a visual representation of the domain of discourse and of the state of the dialogue. We describe an experiment in which uni-modal and multi-modal interfaces are compared in terms of effectiveness, efficiency and satisfaction. The results of the experiment show a strong learning effect. Subjects who start using the multi-modal interface subsequently have a strong advantage when switching to the uni-modal (speech-only) interface, compared to subjects who start by using the uni-modal interface, switching to the multi-modal interface later on. The results are discussed in terms of the need to establish an appropriate user model as early as possible. We discuss implications of this interpretation for interaction design.