This paper presents a neural method to train onset-and-offset-aware sound event detection (SED) using heterogeneously labeled data. SED models are typically trained to predict frame-wise event activities, which have temporal fluctuations, resulting in unstable event boundaries. An end-to-end (E2E) method based on a hidden semi-Markov model (HSMM) has been proposed to improve performance by converting frame-wise predictions into event boundaries. This method, however, relies on temporal (strong) labels, which are costly to annotate. To overcome this limitation, we propose an E2E method to train an HSMM-based model from clip-level labels and unlabeled data. While the strong supervision was formulated to maximize event-wise posterior probabilities, we derive probabilistic objectives for such incompletely labeled data. Experimental results on the DESED dataset show that our method outperforms standard frame-wise methods.