Universal audio representation learning aims to obtain foundational models that are useful for diverse tasks involving speech, music and environmental sounds. To achieve this, often methods inspired by self-supervised models from NLP (e.g., BERT), and vision, like masked autoencoders (MAE), are adapted to the audio domain. In this work, we explore the use of EnCodec, a neural audio codec, to generate discrete targets for a MAE-based universal audio model. We evaluate our approach, EnCodecMAE, across various tasks and find that, on average, it outperforms state-of-the-art audio representation models. Moreover, we analyze the impact of several factors on downstream performance, concluding that increasing model size leads to performance improvements, that the optimal input representation depends on the type of task, that self-training is benefical and that diversity in the training dataset is essential to achieve good performance across different audio tasks.