Performance of current speech recognition systems is significantly deteriorated when exposed to strongly noisy environment. It can be attributed to background noise and Lombard effect (LE). Attempts for LE-robust systems often display a tradeoff between LE-specific improvements and the portability to neutral speech. Therefore, towards LE-robust recognition, it seems effective to use a set of conditions-dedicated subsystems driven by a condition classifier, rather than attempting for one universal recognizer.
Presented paper focuses on a design of a two-stage recognition system (TSR) comprising talking style classifier (neutral/LE) followed by two style-dedicated recognizers differing in input features. First, the binary neutral/LE classifier is built, with a particular interest in developing suitable features for the classification. Second, performance of common speech features (MFCC, PLP), LE-robust features (Expolog) and newly proposed features is compared in neutral/LE digit recognition tasks. In addition, robustness to the changes of average speech pitch and various noise backgrounds is evaluated. Third, the TSR is built, employing two recognizers, each using style-specific features. Comparison of the proposed system with either neutral-specific or LE-specific recognizer on a joint neutral/LE speech shows an improvement 6.5→4.2 % WER on neutral and 48.1→28.4 % WER on LE Czech utterances.