Speech emotion recognition (SER) aims to detect the emotion of the speaker involved in a given utterance. Most existing SER methods focus on local speech features by stacking convolutions and training all segments of an utterance with an utterance-level label. Two deficiencies exist in these methods: i) learning only local speech features may be insufficient for SER due to the ambiguity of emotions; ii) consistent supervision of each segment may lead to label error propagation, as the true emotions of some segments may not match the utterance label. To solve the two issues, we first devise a global-local fusion network to model both long- and short-range relations in speech. Second, we tailor a novel head-k-pooling loss for SER tasks, which dynamically assigns labels for each segment and selectively performs loss calculation across segments. We test our method on the IEMOCAP and the newly collected ST-EMO dataset, and the results show its superiority and stability.