ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Temporal coding with magnitude-phase regularization for sound event detection

Sangwook Park, Sandeep Reddy Kothinti, Mounya Elhilali

Sound Event Detection (SED) is the challenge of identifying sound events into their temporal boundaries as well as sound category. With recent advances in deep learning, more effective SED techniques are investigated through the annual challenge of Detection and Classification of Acoustic Scenes and Events (DCASE). Most SED systems rely on data-driven learning where a deep neural network is trained to minimize the error between model prediction and the truth. While this framework is generally effective at identifying sound classes present in an audio recording, it results in unreliable estimates of temporal information for identifying sound boundaries. In order to heighten the temporal precision, this paper proposes a novel temporal coding of magnitude and phase for embedding vectors in an intermediate layer. This coding is reflected as a regularization term in the objective function for training the model. The regularization allows magnitude of embedding vectors to increase near event boundaries, which represent the onset and offset points. Simultaneously, each of the boundaries are distinguishable from others using phase difference between two neighboring vectors. This approach results in notable improvement in timing sensitivity compared to a baseline system tested on SED task in the context of DCASE2021 challenge.