This talk will consider the multimodal nature of speech and speech technology. Human speech communication is extremely rich. We use many elements to communicate, from words to gestures and eye gaze, and seamlessly interpret these many cues in our conversations. In noisy situations, humans appear to dynamically change their use of different modalities in response to their environment. Is exploiting multimodality hence the solution to developing speech processing algorithms that are robust in everyday environments? In this talk, I’ll look at how visual and linguistic information can be integrated into deep learning frameworks for audio-visual speech recognition and turn taking prediction. I’ll also look at how availability of suitable datasets, with adequate labelling, can help or hinder development in this domain.