In this paper, we explore the hypothesis that the ability to accurately associate semantics to scenes requires processing of sequences of such scenes rather than individual snapshots in time. We build on work that seeks to represent audio as a sequence of descriptors, each spanning multiple frames, by exploring and comparing different ways of obtaining such a lexicon of descriptors. We then present an extension of such an unsupervised learning scheme to video, and report results on experiments with the Multimedia Event Detection, 2011 dataset. We find that learning the set of descriptors automatically from data significantly outperforms the vector quantization-based systems and systems using library based descriptors.
Index Terms: multimedia analysis, semantic labels, unsupervised lexicon learning, audiovisual data retrieval