ISCA Archive Interspeech 2014
ISCA Archive Interspeech 2014

Speech-to-text technology to transcribe and disclose 100,000+ hours of bilingual documents from historical Czech and Czechoslovak radio archive

Jan Nouza, Petr Cerva, Jindrich Zdansky, Karel Blavka, Marek Bohac, Jan Silovsky, Josef Chaloupka, Michaela Kucharova, Ladislav Seps, Jiri Malek, Michal Rott

In this paper, we present the outcome of a 4-year project whose ultimate goal is to develop a complex platform that can transcribe, index and make searchable the historical archive of Czech and Czechoslovak Radio. The archive covers 90 years of public broadcasting and contains hundreds of thousands audio documents. The developed modular platform employs our LVCSR system that has to cope with 2 related languages: Czech and Slovak. Furthermore, it must deal with audio files of varying quality (e.g. recordings originally stored on matrices or tapes, data passed through analog and digital telephone lines, speech recorded during parliament or court sessions, etc.) The system includes speaker and language identification modules, a narrow-band signal detector, a music/song detector, and several other components to enhance transcription accuracy and provide support for multi-optional search. We evaluate the performance on broadcast news test sets grouped according to decades. We show that after acoustic and language model adaptation WER values are in range 8–14% and do not differ much since 1960s to present. We report also results achieved on other types of documents (e.g. talk shows, political debates, public speeches, etc), where the WER is higher but still acceptable for most search tasks.