In this study, we propose MH-SENet, which is designed for speech enhancement by extracting the temporal and spectral features of speech signals in parallel. MH-SENet, which is based on the U-Net architecture, has an encoder and decoder consisting of a bi-directional Mamba and processes it more precisely by considering all the context of the input sequence. Furthermore, a cross-domain Mamba-Transformer block is constructed between the encoder and decoder to effectively fuse information between each time and frequency domains. We evaluated the performance of our proposed MH-SENet on the VCTK + DEMAND dataset and thus it outperformed existing methods by achieving the highest PESQ score. Despite being a hybrid model, the proposed MH-SENet has a lower number of parameters compared to the conventional models.