With recent advances in machine learning, automated dialogue systems have become more able to produce coherent language-based interactions. However, most work on automated spoken language understanding uses still only text transcriptions, i.e., just the lexical content of speech. This ignores the fact that the way we speak can change how our words are interpreted. In particular, speech prosody e.g. pitch, energy, and timing characteristics of speech can be used to signal speaker intent in spoken dialogues. In fact, prosodic features can help automatic detection of both dialogue structure and speaker affect/states. In this talk, I will discuss our recent work on how we can combine non-lexical and lexical aspects to speech to improve speech understanding tasks, such as emotion recognition, and how new approaches to self-supervised learning from speech might be able to help us make the most of the true richness of speech.