Speech foundation models, pre-trained on large amounts of unsupervised or supervised audio data, have demonstrated an impressive ability to transfer their learning to specific domains for speech recognition. Parameter-efficient fine-tuning methods offer an efficient paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, it is unclear how the intermediate features of the foundation model behave, and how to utilize them in a more efficient way. In this paper, we compare the performance of three speech foundation models for speech recognition. We re-investigate how features from different layers behave and propose a simple and effective feature fusion method for efficient transfer learning. Experimental results demonstrate that the proposed method uses 31.7% fewer trainable encoder parameters, 13.4% less computational memory cost than compared method, and does not compromise quality on the target task.