Early diagnosis are essential to reduce costs and improve treatment efficiency. Recently, automatic depression detection (ADD) based on audio and textual features from participant interviews has emerged as a promising approach, attracting significant attention. However, existing models are constrained to monolingual depression datasets, with limited exploration of multi-lingual scenarios. To investigate the effectiveness of multi-lingual data for the ADD task and its transferability in low-resource scenarios, in this paper, we propose a Multi-Modal Multi-Lingual (M3L) depression detection framework and an effective language adaptive fine-tuning (LAFT) to further boost the performance on the target language. M3L utilizes the pretrained speech model Whisper and the text model XLM-RoBERTa to enhance the encoding of multilingual information. Evaluations on the DAIC-WOZ (English) and EATD (Chinese) datasets demonstrate that M3L effectively integrates multi-lingual and multi-modal information, while the proposed LAFT consistently boosts performance across both datasets.