ISCA Archive SLAM 2013
ISCA Archive SLAM 2013

Multi-modal conversational search and browse

Larry Heck, Dilek Hakkani-Tür, Madhu Chinthakunta, Gokhan Tur, Rukmini Iyer, Partha Parthasarathy, Lisa Stifelman, Elizabeth Shriberg, Ashley Fidler

In this paper, we create an open-domain conversational system by combining the power of internet browser interfaces with multi-modal inputs and data mined from web search and browser logs. The work focuses on two novel components: (1) dynamic contextual adaptation of speech recognition and understanding models using visual context, and (2) fusion of users’ speech and gesture inputs to understand their intents and associated arguments. The system was evaluated in a living room setup with live test subjects on a real-time implementation of the multimodal dialog system. Users interacted with a television browser using gestures and speech. Gestures were captured by Microsoft Kinect skeleton tracking and speech was recorded by a Kinect microphone array. Results show a 16% error rate reduction (ERR) for contextual ASR adaptation to clickable web page content, and 7-10% ERR when using gestures with speech. Analysis of the results suggest a strategy for selection of multimodal intent when users clearly and persistently indicate pointing intent (e.g., eye gaze), giving a 54.7% ERR over lexical features.

Index Terms: spoken dialog systems, spoken language understanding, multi-modal fusion, conversational search, conversational browsing.

Cite as: Heck, L., Hakkani-Tür, D., Chinthakunta, M., Tur, G., Iyer, R., Parthasarathy, P., Stifelman, L., Shriberg, E., Fidler, A. (2013) Multi-modal conversational search and browse. Proc. First Workshop on Speech, Language and Audio in Multimedia (SLAM 2013), 96-101

  author={Larry Heck and Dilek Hakkani-Tür and Madhu Chinthakunta and Gokhan Tur and Rukmini Iyer and Partha Parthasarathy and Lisa Stifelman and Elizabeth Shriberg and Ashley Fidler},
  title={{Multi-modal conversational search and browse}},
  booktitle={Proc. First Workshop on Speech, Language and Audio in Multimedia (SLAM 2013)},