End-to-end (E2E) models have significantly advanced automatic speech recognition (ASR), with hybrid architectures that combine Connectionist Temporal Classification (CTC) and attention-based encoder-decoder (AED) models demonstrating superior performance. However, AED architectures, particularly Conformer, face notable challenges with long-form speech, with performance degradation becoming evident for audio exceeding 25 second. In this study, we propose improving the Conformer’s robustness for long-form ASR by applying Gaussian masking to the cross-attention mechanism of the Transformer decoder during inference, using the aligned positions obtained from the CTC prefix score. The proposed method achieves an error reduction rate (ERR) of 88.41% (from 26.41% to 3.06%) for audio longer than 20 seconds on a LibriSpeech evaluation set constructed by concatenating three utterances. Moreover, the method remains effective with either a Transformer or an E-Branchformer encoder.