We present an unsupervised technique, namely story co-segmentation, to automatically extract the common stories on the same topic within a pair of Chinese broadcast news transcripts. Unlike classical topic tracking that usually relies on previously trained topic models, our method is purely data-driven and is able to simultaneously determine the common stories of the input texts. Specifically, we propose an iterative four-step MRF solution to the problem of story co-segmentation using lexical cues only. We first construct a sentence-level graph formulation of the input news transcripts, and initialize foreground and background labeling by lexical clustering. We then update both foreground and background models based on the current labeling. We formalize story co-segmentation as a Gibbs energy minimization problem that balances the optimal objectives of foreground/background likelihood, intra-doc coherence, and inter-doc similarity. Finally, the labeling refinement is obtained by hybrid optimization with QPBO and BP. The effectiveness of our method has been validated on real-world CCTV corpus.
Index Terms: story co-segmentation, foreground and background story modeling, lexical clustering, MRF, QP- BO, belief propagation (BP)