The accuracy and reliability of many speech processing systems may deteriorate under noisy conditions. This paper discusses robust audio anti-spoofing countermeasure for audio in noisy environments. Firstly, we attempt to use a pre-trained speech enhancement model as the front-end module and build a cascaded system. However, the independent denoising process of enhancement models may distort the synthesis artifacts or anti-spoofing related information included in utterances, leading to performance degradation. Therefore, we proposes a new framework for robust audio anti-spoofing by joint training the integrated speech enhancement front-end and anti-spoofing back-end. The final results demonstrate that the joint training framework is more effective than the cascaded framework. Additionally, we propose a cross-joint training scheme, which allows the single-model performance to exceed the result of score level fusion, making the joint framework more effective and efficient.