Deep learning (DL)-based approaches, such as LSTM and Transformer, have shown remarkable advancements in automated speaking assessment (ASA). Nevertheless, two challenges persist: faithful modeling of hierarchical context, such as how to portray word-to-paragraph relationships, and seamless integration of hand-crafted knowledge into DL-based model. In this work, we propose utilizing heterogeneous graph neural networks (HGNNs) as the backbone model to handle hierarchical context effectively. Furthermore, to enhance node embeddings in the HGNN, we integrate external knowledge from spoken content, such as text-based features (vocabulary profile) and speech-based features (filled pauses). Experimental results on the NICT JLE corpus validate the efficacy of our approach, achieving superior performance over the existing Transformer-based language models. Our findings also highlight the utility of our method in accurately evaluating speaking proficiency, showcasing its practical promise.