Accurately recognizing emotion from speech is a necessary yet challenging task due to its complexity. A common problem existing in most of the previous studies is that some of the particular emotions are severely misclassified. In this paper, we propose a novel framework integrating cascaded attention and adversarial joint loss for speech emotion recognition, aiming at discriminating the confusions by emphasizing more on the emotions which are difficult to be correctly classified. Specifically, we propose a cascaded attention network to extract effective emotional features, where spatiotemporal attention selectively locates the targeted emotional regions from the input features. In these targeted regions, the self-attention with head fusion captures the long-distance dependence of temporal features. Furthermore, an adversarial joint loss strategy is proposed to distinguish the emotional embeddings with high similarity by the generated hard triplets in an adversarial fashion. Experimental results on the benchmark dataset IEMOCAP demonstrate that our method gains an absolute improvement of 3.17% and 0.39% over state-of-the-art strategies in terms of weighted accuracy (WA) and unweighted accuracy (UA), respectively.