Conversational Emotion Recognition (CER) is an important topic in the construction of intelligent human-machine interaction systems. The emotion is mainly influenced by the conversational context and the speakers. In addition, sufficient utilization of the relevant features of both speech and text modes is also crucial to the performance of CER. Based on the above considerations, we propose a novel Speaker-aware Cross-modal Fusion Architecture (SCFA). Within a single modality, we design a conversation encoder, including a context encoder and a speaker-aware encoder, to model the conversational content and the intra- and inter-speaker influence, respectively. On this basis, cross-modal fusion attention is introduced to extract the cross-modal characteristics of the conversation, so as to better detect the emotions in conversation. We conduct experiments on the IEMOCAP and MELD datasets. Compared with state-of-the-art baselines, SCFA achieves better performance on average.