Identification of the language of performance of songs is important for applications such as personalized recommendations, discovery, and search. In this paper, we present an automated multimodal approach to identify the singing language of songs that scales to millions of songs. The proposed model uses a variety of song-level features, including a consumption embedding derived from sessions listening data from a music streaming service, segment-level vocals embedding computed from the vocal track of a song, and generic timbral features. Our experimental results show that our approach outperforms benchmark models in the signing-language identification task, and demonstrates the benefit of the multimodal approach through an ablation study. In addition, we present a data augmentation technique to increase the robustness of the model to missing data modalities.