Recently, with the widespread popularity of the Internet, social networks have become an indispensable part of people's lives. As social networks contain information about users' daily moods and states, their development provides a new avenue for detecting depression. Although most current approaches focus on the fusion of multimodal features, the importance of fine-grained behavioral information is ignored. In this paper, we propose the Joint Attention Multi-Scale Fusion Network (JAMFN), a model that reflects the multiscale behavioral information of depression and leverages the proposed Joint Attention Fusion (JAF) module to extract the temporal importance of multiple modalities to guide the fusion of multiscale modal pairs. Our experiment is conducted on D-vlog dataset, and the experimental results demonstrate that the proposed JAMFN model outperforms all the benchmark models, indicating that our proposed JAMFN model can effectively mine the potential depressive behavior.