In recent years, much research has been into speech emotion recognition (SER) using multimodal data. Selective fusion of the features from different modalities is critical for multimodal SER. In this paper, we propose a cross-modal features interaction-and-aggregation network (CFIA-Net) with self-consistency training for SER. Specifically, we design a cross-modal features interaction-and-aggregation (CFIA) module to adaptively interact and integrate the features of audio and text modalities. Moreover, we introduce a self-consistency training strategy, which exploits the features from deeper layers to supervise those from shallower ones to obtain the SER task-related information. The experimental results show that compared with other bimodal SER methods, the CFIA-Net achieves the state-of-the-art performance on the weighted accuracy (WA) of 83.37% and unweighted accuracy (UA) of 83.67% on the IEMOCAP dataset.