Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80% of a wav2vec2.0 model's errors on MCV by selecting 10% of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.