This paper presents the system designed by FOSAFER for the CHiME-8 MMCSG challenge. Our system generates text transcriptions with speaker attributes from natural conversations between two participants in a streaming format. To meet the challenge requirements, we developed a directed automatic speech recognition (ASR) system based on a multi-channel microphone array. The system follows a two-stage training approach and incorporates the SpecAugment dynamic data augmentation technique to improve model performance. Its architecture includes a front-end for speaker label detection and crosstalk suppression using the Non-Linearly Constrained Minimum Varianc (NLCMV) beamformer, and a back-end with a streaming hybrid Transducer ASR model that integrates CTC and RNNT decoders. Additionally, the system handles overlapping speech and speaker switching through Sequential Output Training (SOT). Experimental results demonstrate that our system significantly outperforms the official baseline across various delay conditions, underscoring its effectiveness in complex, real-world environments and its potential for practical applications.