ISCA Archive AVIOS 2012
ISCA Archive AVIOS 2012

New developments in spoken query transcription

Jonathan Mamou, Abhinav Sethy, Bhuvana Ramabhadran

The rapid growth of mobile devices with the ability to browse the Internet has opened up interesting application areas for speech and natural language processing technologies. Voice search is one such application where speech technology is making a big impact by enabling people to access the Internet conveniently from mobile devices. Spoken queries are a natural medium for searching the Mobile Web, especially in the common case where typing on the device keyboard is not practical. Voice search is now recognized as a core feature of mobile devices and several applications have been developed. Generally, in such applications, a spoken query is automatically recognized and the Automatic Speech Recognition (ASR) 1-best hypothesis is sent to a textbased web search engine. Modeling the distribution of words in spoken queries offers different challenges compared to more conventional speech applications. The differences arise from the fact that the voice search application serves as a front-end to web search engines. Users typically provide the search engine with the keywords that will aid them in retrieving the information they are interested in. Spoken web queries, especially keyword style queries, are typically short and do not follow the syntax and grammar observed in other ASR tasks. A natural approach for spoken queries language modeling consists of exploiting a variety of search query logs to model spoken queries. However publically available large corpora for search query logs are rare and in most cases difficult to collect from the Internet. In this paper, we propose two approaches to improve the language model (LM) for voice search ASR systems; these approaches do not rely on the availability of a search engine query log data and thus have a broader application. First, we propose to extract named entities from web textual data such as web crawls and treat them as substitute for query data. An LM targeted towards keywords and query terms is generated and is combined with a more general n-gram LM. Second, we look at measures related to semantic relatedness between query terms. The semantic relatedness between the keywords of a spoken query stems from co-occurring together in the same web document or context even if the keywords are not necessarily adjacent and ordered in the same way as in the query. Our approach is thus based on the idea that if the ASR hypothesis terms tend to cooccur frequently in the searched corpus, the hypothesis is more likely to be correct. The co-occurrence models presented in this paper for the voice search task provide supplementary information to the conventional n-gram statistical LM. We present various types of co-occurrence constraints and scoring functions which capture different forms of semantic relationship between query terms. We show that named-entity and co-occurrence information gives a 2.4% relative accuracy improvement compared to the best baseline from an unpruned n-gram model.