In this study, we propose a multimodal model for predicting the end-of-utterance probability in spoken dialogue systems, highlighting the unique role of visual cues in addition to acoustic and linguistic information. Although the effectiveness of visual cues, such as gaze, mouth, and head movements, has been suggested, few studies have fully incorporated them into turn-taking models, and the relative importance of these visual cues has also been underresearched. To address these issues, we first conducted an ablation study on visual features, showing the larger contribution of eye movements than mouth and head movements. Additionally, an end-to-end visual feature extraction model utilizing 3D-CNN is employed to comprehensively capture these visual cues. By combining visual features with acoustic and verbal information, AUC score for end-of-utterance prediction improved from 0.896 to 0.920, demonstrating the effectiveness of incorporating these visual cues in turn-taking models.