In this paper, we propose an innovative integrated approach to leveraging available spoken content while detecting events in consumergenerated multimedia data (i.e., YouTube videos). Spoken content in consumer videos exhibits several challenges. For example, unlike Broadcast News, the spoken audio is typically not labeled. Also, the audio track in consumer videos tends to be noisy and the spoken content is often sporadic. Here, we describe three recent improvements that are specifically targeted at overcoming the challenges in consumer videos: robust data-driven keyword selection, automatic discovery of word-classes for keyword expansion, and a keyword spotting approach for improving recall in noisy conditions. These improvements were integrated into the audio analysis component of the BBN VISER system that demonstrated top performance in the 2011 TRECVID Multimedia Event Detection (MED) task. Experimental results on the 2011 TRECVID MED task clearly demonstrate the effectiveness of the three improvements.
Index Terms: multimedia event detection, keyword selection, keyword expansion, keyword spotting.