For multimodal sentiment analysis (MSA), the text-centric approach has been shown to be superior in performance, which adopts powerful text models (e.g., BERT) as backbone and studies how to effectively incorporate non-verbal modalities (i.e., audio and visual) to obtain more refined and expressive word representations. In previous methods, the non-verbal information injected into a word representation only comes from a non-verbal segment corresponding to the time span of the word, ignoring the long-range dependencies across modalities. Meanwhile, these methods utilize the Softmax normalization function-based attention mechanism, which makes it difficult to highlight the important information in non-verbal sequences. To this end, this paper proposes a non-verbal information injection method called Word-wise Sparse Attention (WSA) to capture the cross-modal long-range dependencies. When injecting the non-verbal information into a word, the word is used as the semantic anchor to search for the most relevant non-verbal information from holistic non-verbal sequences. Furthermore, an advanced Multimodal Adaptive Gating (MAG) mechanism is introduced to determine the amount of information injected from non-verbal modalities. We evaluate our method on the two publicly available multimodal sentiment datasets. Experimental results show that the proposed approach improves the baseline model consistently on all metrics.