This paper introduces an innovative approach that leverages a cross-attention-guided WaveNet combined with a coarse-to-fine granularity strategy to enhance the detailed reconstruction of Mel spectrograms from time-domain EEG signals. The proposed model utilizes WaveNet to sequentially reconstruct the envelope, 10-band Mel, 80-band Mel, and magnitude at progressively finer granularity levels. A cross-attention mechanism is introduced to explore correlations across modalities to address the modality gap. A combined loss function and Mixup augmentation technique are also employed to enhance the reconstruction performance. Notably, our approach achieves Pearson correlation values of 0.0651 ± 0.0153 for the validation set and 0.0413 ± 0.0169 for the heldout-subjects test set, securing the second position in the 2024 Auditory EEG Challenge. We also validated the contribution of each module through ablation experiments. The source code is available online.