Local audio-visual descriptors are often compactly stored using representations such as the soft quantization histogram [1]. Typically, classification performance with histogram representations is improved through the use of large codeword sets. Unfortunately, this approach runs into overfitting and scalability challenges when applied to richly diverse real-world collections. A novel “i-vector” approach was recently proposed for the speaker-verification task [2]. In this work, we study the relative effectiveness of the i-vector as a compact representation of local audio descriptors (e.g., MFCC's) within a multimedia event detection system. Specifically, we model the local audio descriptors using a Guassian Mixture Model (GMM). Following [2], we constrain theGMMparameters to a low-dimensional subspace while preserving most of the variability (i.e., information) in the descriptors. The GMM parameters in the subspace constitute a compact representation that exhibits robustness in the face of sparse data. We evaluate the method by performing the multimedia event detection (MED) task using only audio information within consumer (e.g., YouTube) videos. Experiments with the 2011 TRECVID MED data show that the i-vector provides superior performance and lower dimensionality than the bag-of-words soft quantization histograms used in the state-of-the-art BBN VISER system in the 2011 TRECVID MED Evaluation.
s
J. C. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J. M. Geusebroek, “Visual word ambiguity,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1271–1283, 2010.
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 4, pp. 788–798, may 2011
Index Terms: multimedia event detection, factor analysis