ISCA Archive Eurospeech 1997
ISCA Archive Eurospeech 1997

A maximum likelihood model for topic classification of broadcast news

Richard Schwartz, Toru Imai, Francis Kubala, Long Nguyen, John Makhoul

We describe a new algorithm for topic classification that allows discrimination among thousands of topics. A mixture of topics explicitly models the fact that each story has multiple topics, that different words are related to different topics, and that most of the words are not related to any topic. The resulting model, trained by EM, has sharper distributions of words that result in more accurate topic classification. We tested the algorithm on transcribed broadcast news texts. When trained on one year of stories containing over 5,000 different topics and tested on new (later) stories the first choice topic was among the manually annotated choices 76% of the time.