Semi-supervised Sound Event Detection (SSED) is to recognize the categories of events and mark their onset and offset times using a small amount of weakly-labeled and a large-scale of unlabeled data. To exploit unlabeled data effectively and reduce over-fitting, regularization techniques play a critical role in SSED. In this paper, we proposed a novel jointly regularized and locally down-sampled Conformer (Joint-Former) model for SSED. Joint-Former first locally down-samples the spectrogram and learns the token representations with high temporal resolution and low computational cost. Then, Joint-Former effectively exploits unlabelled data in SSED by integrating Mean-Teacher and Masked Spectrogram Modeling using joint regularization through a multitask learning framework. Extensive experiments on DCASE 2019, DCASE 2020, and DCASE 2021 task4 SSED datasets show that Joint-Former greatly outperformed existing methods.