Most conversational understanding (CU) systems today employ a cascade approach, where the best hypothesis from automatic speech recognizer (ASR) is fed into spoken language understanding (SLU) module, whose best hypothesis is then fed into other systems such as interpreter or dialogue manager. In such approaches, errors from one statistical module irreversibly propagates into another module causing a serious degradation in the overall performance of the conversational understanding system. Thus it is desirable to jointly optimize all the statistical modules together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot (semantic tag) sequence jointly given the input acoustic stream. On Microsoft's CU system, we show 1.3% absolute reduction in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of the state-of-the-art recognizer followed by a slot sequence tagger.
Index Terms: ME, CRF, SLU, CU, ASR