ISCA Archive ICSLP 2002
ISCA Archive ICSLP 2002

N-word-sequence frequency noise mitigation for SLM based on binomial distribution

Yibao Zhao, Guojun Zhou

It is often difficult to build a robust Statistical Language Model (SLM) for a domain-specific spoken dialogue system because it’s very challenging to collect enough data for a specific domain. One solution is to build an SLM based on domain-specific grammar rules which do not need to collect a lot of data. A number of studies have found that this solution is effective and encouraging. However, the statistical information obtained from domain-specific grammar rules can’t correctly represent the distribution of n-word-sequences in real applications, and thus resulting in the undesirable performance. It is observed that the n-word-sequence frequency-of-frequency distribution obtained from general-purpose corpus has a smooth curve, while the n-word-sequence frequency-of-frequency obtained from domain grammar rules does not. Based on the assumption that each n-word-sequence in real applications normally follows a binomial distribution, this paper proposes a pair of n-word-sequence frequency smoothing algorithms called Coast Algorithm and Tide Algorithm, which can significantly mitigate the "noise" presented in n-word-sequence frequency-of-frequency directly obtained from domain-specific grammar rules. Our experiments with a domainspecific spoken dialog system show that the SLM generated from domain-specific grammar rules but smoothed using the Coast and Tide algorithms can reduce the TER (Tag Error Rate) by 13.02% (relative). Therefore, these two algorithms can improve the system performance.