Spoken language understanding (SLU) is a critical task in task-oriented dialogue systems. However, automatic speech recognition (ASR) errors often impair the understanding performance. Despite many previous models have obtained promising results for improving ASR robustness in SLU, most of them treat clean manual transcripts and ASR transcripts equally during the fine-tuning stage. To tackle this issue, in this paper, we propose a novel method termed C²A-SLU. Specifically speaking, we add calculated cross attention to the original hidden states and apply contrastive attention to compare the input transcript with clean manual transcripts to distill the contrastive information, which can better capture distinctive features of ASR transcripts. Experiments on three datasets show that C²A-SLU surpasses existing models and achieves a new state-of-the-art performance, with a relative improvement of 3.4% in terms of accuracy over the previous best model on SLURP dataset.