To achieve efficient feature fusion, existing research tends to employ cross-attention to control the contributions of different modalities in fusion. However, this inevitably causes high computational effort and introduces noise weights due to redundant computations. Therefore, this paper proposes sliding window attention (SliWa) to control the feature perception range and dynamically model the modality fusion at different granularities. In addition, we present a novel feature map classifier (FMC) based on high-response feature reuse (HRFR), which explicitly preserves the deep emotional feature structure, thus preventing the submersion of the crucial classification information after average flattening and the negative impacts of parameter flooding. We unify the mentioned modules in the SWRR framework, and the experimental results on the commonly used datasets IEMOCAP and CMU-MOSEI reveal the effectiveness of SWRR in improving the performance of emotion recognition.