Deep learning-based speech synthesis has significantly improved realistic audio deepfakes. Despite advanced techniques such as self-supervised learning (SSL) and datasets, current state-of-the-art (SOTA) detection systems fail in out-of-domain scenarios due to the inability to generalize. This work explores the generalization problem through comprehensive experimentation on cross-data evaluation. We observed how training data impacts model generalization, revealing that even SOTA systems struggle with consistent performance across different evaluation settings. This indicates a lack of extensive generalization abilities, especially in SSL approaches. To address this problem, we propose a multi-stage training framework alongside an ensemble of different systems to enhance the robustness and reliable detection in known and unknown out-of-domain scenarios. Experimental evaluation underscores the importance of an ensemble approach to mitigate the limitations in individual systems.