ISCA Archive Interspeech 2024
ISCA Archive Interspeech 2024

Neural Compression Augmentation for Contrastive Audio Representation Learning

Zhaoyu Wang, Haohe Liu, Harry Coppock, Björn Schuller, Mark D. Plumbley

The choice of data augmentation is pivotal in contrastive self-supervised learning. Current augmentation techniques for audio data, such as the widely used Random Resize Crop (RRC), underperform in pitch-sensitive music tasks and lack generalisation across various types of audio. This study aims to address these limitations by introducing Neural Compression Augmentation (NCA), an approach based on lossy neural compression. We use the Audio Barlow Twins (ABT), a contrastive self-supervised framework for audio, as our backbone. We experiment with both NCA and several baseline augmentation methods in the augmentation block of ABT and train the models on AudioSet. Experimental results show that models integrated with NCA considerably surpass the original performance of ABT, especially in the music tasks of the HEAR benchmark, demonstrating the effectiveness of compression-based augmentation for audio contrastive self-supervised learning.