Mandarin digit string recognition (MDSR) is a difficult task in the field of automatic speech recognition (ASR) and using pitch feature can significantly increase the performance. In conventional methods of pitch feature extraction, random value is commonly used as pitch output in unvoiced (UV) frames, which causes serious statistical confusion between voiced (V) and UV units and incurs abnormal likelihood in decoding. In this paper we propose to normalize the distribution of random values assigned in UV frames to avoid the above side-effects and introduce extra discrimination information in statistics. Besides, voice-level (VL), which is an intermedial parameter used in pitch estimation for V/UV decision, is adopted to expand the acoustic feature stream. VL features indicate the intensity of periodicity of speech frames and provide complementary information for ASR. In the experiments the proposed methods significantly improve the accuracy of MDSR tasks and achieve the sentence error reduction rate (ERR) of 13.3% and 15.1% versus the baseline in the evaluation on free-length and 6-digit testing set, respectively.
Index Terms: Mandarin digit string recognition, automatic speech recognition, pitch feature extraction