ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

Paralinguistic Classification of Mask Wearing by Image Classifiers and Fusion

Jeno Szep, Salim Hariri

In this study, we address the ComParE 2020 Paralinguistics Mask sub-challenge, where the task is the detection of wearing surgical masks from short speech segments. In our approach, we propose a computer-vision-based pipeline to utilize the capabilities of deep convolutional neural network-based image classifiers developed in recent years and apply this technology to a specific class of spectrograms. Several linear and logarithmic scale spectrograms were tested, and the best performance is achieved on linear-scale, 3-Channel Spectrograms created from the audio segments. A single model image classifier provided a 6.1% better result than the best single-dataset baseline model. The ensemble of our models further improves accuracy and achieves 73.0% UAR by training just on the ‘train’ dataset and reaches 80.1% UAR on the test set when training includes the ‘devel’ dataset, which result is 8.3% higher than the baseline. We also provide an activation-mapping analysis to identify frequency ranges that are critical in the ‘mask’ versus ‘clear’ classification.