Joint training of speech enhancement and ASR can make the model work robustly in noisy environments. However, most of these models work directly in series, and the information of the noisy is not reused before the ASR input, leading to a large amount of distortion in the features of the input ASR. In order to solve the problem of distortion from the root, we propose a CSE network which is used to denoise the noisy by combining mask and mapping in the complex domain. Secondly, we also propose CAF, which re-extracts the original speech features of from the noisy by the coarse-grained attention mechanism and deeply fuses them with the enhanced speech features. In addition, to make the output space of CAF closer to the input space expected by ASR, we also propose to compute loss for CAF with multi-layer output of pretrained model. Our experiments are trained and tested on the dataset generated by AISHELL-1 and DNS3. Experimental results show that the CER of our model is 13.425 under the condition of SNR of 0dB and the CER of 20.671 under the condition of SNR of -5dB. And the robustness is 93.869% on dataset generated by AISHELL-2 and MUSAN.