Recent fake audio detection methods often leverage large speech models to achieve robust speech representations. These models are typically very deep, providing multiple layer-wise representations. However, current works often rely solely on single layer representation or feature fusion to extract one utterance-level representation for decision making. These methods risk underutilizing rich information from multiple layers and might induce feature collapse. We propose a novel layer-wise decision fusion method that applies fusion after per-layer decision making and achieves the best cross-dataset performance on In-the-Wild dataset (EER 6.90%) compared to other strong baselines. Our model design also makes the model more transparent, allowing us to conduct detailed analysis to reveal the underlying mechanism of decision making.