Bilingualism is rising worldwide, yet bilingual child assessments face major challenges. A shortage of bilingual clinicians and the labor-intensive nature of speech data annotation often cause misdiagnoses, delaying care and research. Using a Mandarin-English adult-child speech dataset (53 telehealth sessions), we explore how speech models can automate the annotation of clinical data involving multi-languages, multi-speakers, children's speech, and code-switching utterances. Findings indicated that simple pre-processing improves automatic speech recognition (ASR) accuracy. Specifically, integrating speaker diarization with OpenAI’s Whisper medium model reduces word error rates to 35% for child speech and 30% for code-switching, rivaling fine-tuned transformer models. As the first ASR pipeline evaluation for a Mandarin-English clinical dataset, our study highlights model limitations, establishes a benchmark for bilingual speech technology, and improves clinical services.