This paper explores the wav2vec 2.0 model for dialect identi-fication, focusing on the impact of the back-end network dur-ing fine-tuning. Prior research has typically used wav2vec 2.0 as a frame-level feature extractor, followed by a simple back-end consist-ing of a pooling layer and a fully connected layer. In contrast, we employ multi-scale aggregation and a graph neural net-work to design a more sophisticated back-end that implicitly exploit phoneme sequence information and significantly im-proves system performance. We evaluate our system on the dialect identification task of the Oriental Language Recogni-tion Challenge 2020 (AP20-OLR). Experimental results demonstrate that our system outperforms the state-of-the-art baseline by a relative reduction of 50% in average cost per-formance (Cavg). We also verify the effectiveness of our pro-posed back-end network, which results in a relative reduction of 54% in Cavg. Our findings highlight the importance of incorporating a more effective back-end network for improved dialect identification performance when using the wav2vec 2.0 model.