With the widespread use of technologies directed towards children, child-machine interactions have become a topic of great interest. Computers must interpret relevant contextual user cues in order to provide a more natural interactive environment. Our focus in this paper is analyzing audio-visual user uncertainty cues using spontaneous conversations between a child and computer in a problem-solving setting. We hypothesize that we can predict when a child is uncertain in a given turn using a combination of acoustic, lexical, and visual gestural cues. First, we carefully annotated the audio-visual uncertainty cues. Next, we trained decision trees using leave-one-speaker-out cross-validation to find the more universal uncertainty cues across different children, attaining 0.494 kappa agreement with ground-truth uncertainty labels. Lastly, we trained decision trees using leave-one-turn-out cross-validation for each child to determine which cues had more intra-child predictive power and attained 0.555 kappa agreement. Both of these results were significantly higher than a voting baseline method but worse than average human kappa agreement of 0.744. We explain which annotated features produced the best results, so that future research can concentrate on automatically recognizing these uncertainty cues from the audio/video signal.