Autism Spectrum Disorder (ASD) is a neurodevelopmental condition characterized by deficits in social communication, affecting both language use and speech patterns. Since assessment relies on behavioral observations rather than standardized medical tests, developing an objective evaluation method is essential. Recognizing that ASD impacts both language and speech production, this study proposes a cascaded multimodal framework for ASD severity assessment. The framework processes raw audio, generates transcriptions via automatic speech recognition, and extracts linguistic and acoustic features using speech-language foundation models. Given the atypical suprasegmental and segmental speech characteristics in ASD, two speech foundation models are employed. A co-attention mechanism then integrates these representations to estimate severity. Achieving a Spearman’s correlation of 0.5629 with human ratings, the proposed approach offers a scalable, fully automated ASD assessment tool.