ISCA Archive Eurospeech 1993
ISCA Archive Eurospeech 1993

Talker enrollment for speech recognition by synthesis

Richard Brierton, Nigel Sedgwick

In Recognition by Synthesis (RbS), subword models of the type used in Text-to-Speech (TtS) synthesis are used for speech recognition. These subword models are variable duration acoustic-phonetic segments. The segment parameters differ from talker to talker, a complete set being called a Talker Characterisation Table (TCT). We describe algorithms for automatically tuning a TCT to the speech of a particular talker, using connected speech enrolment utterances covering the whole phonetic range, manually transcribed at the phonemic level. Speech synthesised using the tuned TCT sounds more natural and more like that of the enrolled talker, when using both synthetic and copied natural prosody, for utterances inside and outside the enrolment set. Thus a generic rather than an utterance specific TCT has been produced. Algorithms are also described for automatically transcribing speech into a sequence of acoustic-phonetic segments, constrained only the phonotactics of the language and using a TCT tuned to the talker.

Keywords: automatic speech recognition, recognition by synthesis, talker enrolment.