Laughter can not only convey the affective state of the speaker but also be perceived differently based on the context in which it is used. In this paper, we focus on detecting laughter in adults’ speech using the MAHNOB laughter database. The paper explores the use of novel long-term acoustic features to capture the periodic nature of laughter and the use of computer vision-based smile features to analyze laughter. The classification accuracy of the leave-one-speaker-out cross-validation using a cost-sensitive learning approach with a random forest classifier with 100 trees for detecting laughter in adults’ speech was 93.06% using acoustic features alone. Using only the visual features, the accuracy was 89.48%. Early fusion of audio and visual features resulted in an absolute improvement in the accuracy, compared to using only acoustic features, by 3.79% to 96.85%. The results indicate that the novel acoustic features do capture the repetitive characteristics of laughter, and the vision-based smile features can provide complementary visual cues to discriminate between speech and laughter. The significant finding of the study is the improvement of not only the accuracy, but a reduction in the false positives using the fusion of audio-visual features.
Index Terms: paralinguistics, laughter, syllable-level features, smile, multi-modal