We examined the content of 6 talk-show TV programs in order to better understand the challenges posed by this program genre to automatic transcription. The selected programs were first segmented, transcribed and annotated by experts. Most of the speech content was found in conversational style with a significant portion of overlapped speech, about 18%. Then, automatic speech recognition experiments were conducted showing that recognition performance on talk-show programs is much worse 28.3% word error rate (WER), in comparison with that achieved on broadcast news programs, 10.9% WER. For talk-shows performance varied tangibly between non-overlapped speech, 21.8% WER, and overlapped speech, 58.5% WER. On clean, non-overlapped speech a 18.7% WER is achieved, this result is significantly worse than the result achieved for the dominant condition in broadcast news programs represented by clean read/planned speech from the anchormen, 7.6% WER.
Index Terms: broadcast conversations, overlap speakers, spontaneous speech, automatic transcription