The CHiME-6 dataset presents a difficult task with extreme speech overlap, severe noise and a natural speaking style. The gap of the word error rate (WER) is distinct between the audios recorded by the distant microphone arrays and the individual headset microphones. The official baseline exhibits a WER gap of approximately 10% even though the guided source separation (GSS) has achieved considerable WER reduction. In the paper, we make an effort to integrate an improved GSS with a strong automatic speech recognition (ASR) back-end, which bridges the WER gap and achieves substantial ASR performance improvement. Specifically, the proposed GSS is initialized by masks from data-driven deep-learning models, utilizes the spectral information and conducts a selection of the input channels. Meanwhile, we propose a data augmentation technique via random channel selection and deep convolutional neural network-based multi-channel acoustic models for back-end modeling. In the experiments, our framework largely reduced the WER to 34.78%/36.85% on the CHiME-6 development/evaluation set. Moreover, a narrower gap of 0.89%/4.67% was observed between the distant and headset audios. This framework is also the foundation of the IOA’s submission to the CHiME-6 competition, which is ranked among the top systems.