ISCA Archive ICSLP 2002
ISCA Archive ICSLP 2002

Exploring sub-word features and linear support vector machines for German spoken document classification

Martha Larson, Stefan Eickeler, Gerhard Paaß, Edda Leopold, Jörg Kindermann

Using sub-word features for spoken document classification raises two potential drawbacks. First, if the speech recognizer recognizes sub-word units directly, the risk arises that word-level discriminative features are irretrievably lost. This effect is aggravated by depressed recognition accuracy, such as that associated with speakerand domain-independent systems. Second, if input documents are expanded by combining subword units into higher-level features, in compensation for lacking word-level discriminators, the size of the classifier input space expands rapidly, inviting the danger of over- fitting. This paper reports results of experiments with a simple, but real-world, binary topic classification task on a corpus of un-edited German-language radio documents. We compare a Naive Bayes classifier to a Linear Support Vector Machine (LSVM) and determine that benefits of sub-word features indeed outweigh potential drawbacks. The LSVM in particular profits from subword features supplemented by higher-order combinations, reflecting its ability to control input space complexity independently of dimension.