ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Training Onset-and-Offset-Aware Sound Event Detection on a Heterogeneous Dataset via Probabilistic Sequential Modeling

Tomoya Yoshinaga, Yoshiaki Bando, Keitaro Tanaka, Keisuke Imoto, Masaki Onishi, Shigeo Morishima

This paper presents a neural method to train onset-and-offset-aware sound event detection (SED) using heterogeneously labeled data. SED models are typically trained to predict frame-wise event activities, which have temporal fluctuations, resulting in unstable event boundaries. An end-to-end (E2E) method based on a hidden semi-Markov model (HSMM) has been proposed to improve performance by converting frame-wise predictions into event boundaries. This method, however, relies on temporal (strong) labels, which are costly to annotate. To overcome this limitation, we propose an E2E method to train an HSMM-based model from clip-level labels and unlabeled data. While the strong supervision was formulated to maximize event-wise posterior probabilities, we derive probabilistic objectives for such incompletely labeled data. Experimental results on the DESED dataset show that our method outperforms standard frame-wise methods.