Efficiently selecting training data is crucial for improving automatic speech recognition (ASR) models while minimizing annotation costs. This research extends TypiClust, a data sampling method originally validated for images, to ASR by jointly considering acoustic diversity and transcription uncertainty. Our method clusters speech embeddings from Wav2Vec2 and prioritizes typical and low transcription-confidence samples within each cluster, ensuring selection of representative and hard-to-train samples. We evaluate our method on Japanese speech datasets, CSJ and ReazonSpeech, demonstrating that it achieves lower recognition error rates than random or single-criterion-based data selection. These results indicate that selecting data based on both diversity and uncertainty enhances speech recognition model performance while reducing annotation costs.