ISCA Archive Interspeech 2012
ISCA Archive Interspeech 2012

Online story segmentation of multilingual streaming broadcast news

Amit Srivastava, Saurabh Khanwalkar, Gretchen Markiewicz, Guruprasad Saikumar

We present an online story segmentation approach for Broadcast News (BN) that is built upon and integrated into BBN COTS multilingual Broadcast Monitoring System (BMS). We take a discriminative model-based approach, using Support Vector Machines to segment BN transcriptions into thematically coherent stories within the real-time constraints defined by BMS. We extract lexical, topical and story boundary cue features from source language transcriptions, machine translated (MT) English and metadata generated by BMS. We leverage BBN's Topic Classification technique to extract topic persistence features, and incorporate topic supporting words and topic clusters to encode thematic transitions. Using the discriminative model-based approach, we get a relative gain of 27.9% on English BN and 22.0% on Arabic BN over a rule-based system. We also demonstrate a relative improvement of 11.8% in segmentation performance using features extracted from MT English compared to Arabic source features. We highlight the impact of topic model training in our story segmentation approach by varying corpus size to achieve a 13.7% relative gain with increase in number of topics.

Index Terms: story segmentation, topic classification, topic modeling