Most research on task oriented dialog modeling is based on written text input. However, practical dialog systems often use spoken input. Typically, input speech is converted into text using ASR, which are error-prone. Furthermore, most systems don't address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, we hosted a speech-aware dialog state tracking challenge and created a public corpus which can be used to investigate the performance gap between the written and spoken input. We created three spoken versions of the popular written-domain MultiWoz task and provide waveforms, ASR transcripts, and audio encodings to encourage wider participation from teams that may not have access to ASR systems. In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.