Multimodal approaches to predict depression severity is a highly researched problem. We present a multimodal depression severity score prediction system that uses articulatory coordination features (ACFs) derived from vocal tract variables (TVs) and text transcriptions obtained from an automatic speech recognition tool that yields improvements of the root mean squared errors compared to unimodal classifiers (14.8% and 11% for audio and text, respectively). A multi-stage convolutional recurrent neural network was trained using a staircase regression (ST-R) approach with the TV based ACFs. The ST-R approach helps to better capture the quasi-numerical nature of the depression severity scores. A text model is trained using the Hierarchical Attention Network (HAN) architecture. The multimodal system is developed by combining embeddings from the session-level audio model and the HAN text model with a session-level auxiliary feature vector containing timing measures of the speech signal. We also show that this model tracks the severity of depression for most of the subjects reasonably well and we analyze the underlying reasons for the cases with significant deviations of the predictions from the ground-truth score.