Analysis of vocal fold vibration in high-speed videoendoscopy can aid in the assessment of voice disorders. Glottis segmentation is a preliminary step of this analysis. Previous deep learning approaches have focused on fully supervised learning methods for glottis segmentation which require pixel-level annotation. Collection of pixel-level annotated data is time consuming and tedious. To overcome this challenge, in this work, we explore the use of bounding box labels for weakly supervised glottis segmentation. As such, bounding box labels are relatively easier to annotate. The proposed method uses multiple instance learning to leverage bounding box labels in the form of bag labels. The method outperforms the baseline method (trained with bounding box as mask) by 0.20 in terms of dice score, and matches the performance of fully supervised version after fine-tuning.