Cover song identification~(CSI) has been a challenging task and an import topic in music information retrieval~(MIR) community. In recent years, CSI problems have been extensively studied based on deep learning methods. In this paper, we propose a novel framework for CSI based on a joint representation learning method inspired by multi-task learning. In specific, we propose a joint learning strategy which combines classification and metric learning for optimizing the cover song model based on WideResNet, called LyraC-Net. Classification objective learns separable embeddings from different classes, while metric learning optimizes embedding similarity by decreasing the inter-class distance and increasing the intra-classs separability. This joint optimization strategy is expected to learn a more robust cover song representation than methods with single training objectives. For the metric learning, prototypical network is introduced to stabilize and accelerate the training process, together with triplet loss. Furthermore, we introduce SpecAugment, a popular augmentation method in speech recognition, to further improve the performance. Experiment results show that our proposed method achieves promising results and outperforms other recent CSI methods in the evaluations.