Singing melody extraction (SME) is an important task in music information retrieval (MIR). In this paper, we propose a separate spectrum-based SME model and a joint network that combines the pre-trained and spectrum-based models. In the joint network, we design an attention aggregation module (AAM) consisting of cross-attention (CA) and adaptive decision fusion (ADF) to effectively fuse the intermediate features from two models. Furthermore, we introduce a self-consistency training strategy, which utilizes hard and soft labels to supervise two separate models to better obtain the SME task-relevant information. Experimental results show that our proposed method, Joint Network, outperforms six compared state-of-the-art methods, achieving overall accuracy (OA) scores of 91.6%, 92.5%, and 78.9% on the ADC 2004, MIREX 05, and MEDLEY DB datasets, respectively. Visualized results show that the Joint Network can reduce the octave and melody detection errors.