Speech emotion recognition (SER) systems are generally optimized through utterance-level supervision, but emotion is complex and often varies within an utterance. This paper propose a local prototypical mapping network (LPMN) to model frame-level emotional variance and better exploit within-frame dynamics to improve performance. Specifically, a codebook of prototypes is first constructed to characterize complex frame-level features output from a pre-trained backbone network. An utterance-level embedding is obtained by selecting the most emotion-related mappings via a similarity measure between features and prototypes, motivated by multiple instance learning algorithms. Prototypes can be jointly optimized with quantization loss and CE loss. A prototype selection scheme is further proposed to select emotion-aware prototypes to reduce bias caused by irrelevant factors. Evaluations on IEMOCAP and MER2023 benchmarks demonstrate the effectiveness of LPMN.