Contrastive self-supervised learning has seen great success in computer vision while been less investigated in the audio processing field, in particular depression detection, a socially critical challenge. Detecting depression from one's speech has been examined via various audio representations, including acoustic feature combinations and model-based ones. This paper proposes to obtain depressive audio representations by departing speech via reference features from an emotion recognition model. Furthermore, we propose a reference-enhanced contrastive learning (ReCLR) to select fine-grained positive instances and allocate weight to negative instances. The depression detection results indicate that contrastive learning is effective in such an audio task. Moreover, our modified ReCLR strategy has outperformed contrastive training without references.