In this paper, we propose a Conformer-based dual-branch framework to fully exploit the multi-layer features of pre-trained model (PTM) for sound event detection (SED). The proposed model follows the mainstream framework, consisting of a front-end encoder built upon the pre-trained Audio Teacher-Student Transformer (ATST) model and a back-end context network. For the front-end, we evenly divide the Transformer layers of ATST into shallow and deep parts, each of which is fused using weighted integration to effectively incorporate multi-layer features for SED. In the back-end, dual-branch Conformer is used to extract both high-level and low-level clues from the aggregated features. Furthermore, we adopt adapter modules instead of full fine-tuning the ATST model in the training stage, achieving comparable performance with far fewer parameters. We carry out experiments on the validation set for Task 4 of DCASE 2024 Challenge, and results demonstrate the efficacy of the proposed method.