The speech clustering system presented in this paper organizes a database of spoken documents (e.g., audio recordings of newspaper articles) according to topic, without the need for a priori knowledge of subject matter. Documents are represented by a histogram of acoustic features. Document histograms are compared with each other using a standardized similarity measure. Standard clustering techniques are employed to organize documents into clusters. This clustering approach is based on a system in use in the text processing community.
Keywords: Automatic topic classification of spoken documents, n-gram analysis