Emotion modelling in speech using deep reinforcement learning (RL) has gained attention within the speech-emotion recognition (SER) community. However, prior studies have primarily centred around recurrent neural networks (RNNs) to capture emotional contexts, with limited exploration of the potential offered by more recent transformer architectures. This paper explores a comprehensive evaluation of training a transformer-based model using deep RL and benchmark its efficacy in SER. Specifically, we explore the effectiveness of a pre trained Wav2vec2 (w2v2) model-based classifier within the deep RL setting. We evaluate the proposed deep RL framework using five publicly available datasets and benchmark the results with three recent SER studies using two deep RL methods. Based on the results, we show that the transformer-based RL agent not only demonstrates an improvement in SER accuracy but also shows a reduction in the time taken to begin emotion classification, outpacing the RNNs that have been commonly used to date. Moreover, by leveraging pre-trained transformers, we observe a reduced need for extensive pre-training which has been a norm in prior research.