By incorporating lip language, audio-visual speech recognition can effectively improve the recognition effect in noisy environments, and will slightly improve the recognition effect in quiet environments. we use a frequency domain attention based residual network (Fca-Net) as the model of the vision front-end module, which extracts more features that are helpful to the AVSR and VSR system at a small cost. And use the powerful speech pre-training model Hu-BERT as the recognition front-end model of ASR. We compare the impact of different model as visual back-end modules and fusion modules on the AVSR system. Our experiments show that the model selection of the fusion module is critical to the performance of the AVSR system. Ultimately, our proposed model achieves state-of-the-art results on audio-visual speech recognition tasks using the LRS2 dataset.