In addition to linguistic information, speech contains non-lexical information, such as emotion, gender, and speaker identity. Recent self-supervised learning methods for speech representation can provide powerful initial feature spaces. However, a few training samples in speech emotion recognition cannot fully utilize the vast pretrained feature space. Herein, we propose an effective use of the feature space. First, to obtain more complementary information, diverse features are extracted by mapping the same utterance to different clusters via multitask learning. Thereafter, fusion methods are investigated according to the correlation among the diversely mapped features. The proposed methods are evaluated on two emotional speech corpora. The experimental results show that the proposed methods can effectively utilize the vast pretrained feature space and achieve state-of-the-art performance, with an unweighted average recall of 78.45% on the benchmark IEMOCAP corpus.