Convolutional recurrent networks (CRN) that combine a convolutional encoder-decoder (CED) structure with a recurrent structure have shown promising results in monaural speech enhancement. However, the commonly used short-time Fourier transform fails to balance the needs of frequency and time resolution effectively, which is crucial for accurate speech estimation. To address this issue, we propose MFT-CRN, a multi-scale short-time Fourier transform fusion model. We process the input speech signal through short-time Fourier transforms with different window functions, and add them layer by layer in the encoder and decoder of the network to achieve feature fusion with different window functions, effectively balancing frequency and temporal resolution. Comprehensive experiments on the WSJ0 dataset show that MFT-CRN significantly outperforms the method using only a single window function in terms of short-time intelligibility and perceptual evaluation of speech quality.