Textual emotion recognition in conversations has gained increasing attention in recent years for the growing amount of applications it can serve, e.g., human-robot interactions, recommended systems. However, most existing approaches are either based on BERT-based model which fail to exploit crucial information about the long-text context, or resort to complex entanglement of neural network architectures resulting in less stable training procedures and slower inference time. To bridge this gap, we first propose a fast, compact and parameter-efficient framework based on fine-tuned pre-trained RoBERTa model with a CNN-LSTM network for textual emotion recognition in conversations. First, we fine-tune the pre-tranined RoBERTa model to effectively learn long-term emotion-relevant context information. Second, convolutional neural network coupled with the bidirectional long short-term memory and joint reinforced blocks are utilized to recognize emotion in conversations. Extensive experiments are conducted on benchmark emotion MELD dataset, and the results show that our model outperforms a wide range of strong baselines and achieves competitive results with the state-of-art approaches.