This study presents a novel multimodal and multitask learning model for predicting five proficiency scores of L2 English speeches. The proposed approach integrates speech and text embeddings using multimodal transformer blocks with cross-modal attention to refine features dynamically between modalities, capturing complementary information. A joint loss function, combining MSE and a Trait-Aware (TA) loss, enhances the model by leveraging relationships among proficiency traits. Experiments with different combinations of four embeddings (MFCCs, GloVe, wav2vec 2.0, and BERT) revealed that the proposed model with wav2vec 2.0 and BERT embeddings achieved the best performance, with a mean PCC of 0.734 and a standard deviation of 0.0129 across five criteria. This approach significantly outperforms unimodal and baseline multimodal models, demonstrating the potential of advanced multimodal architectures and task-aware optimization in automated speech assessment systems.