In recent years, Audio Deepfake Detection (ADD) models have shown promising results in intra-domain. However, they do not perform well in cross-domain scenarios. This is mainly due to the limited variety of domain types and attack methods in training data, as well as insufficient research on hidden feature representation. To address these issues, we present W2VASDG, a generalized ADD system including a self-supervised representation front-end and a domain generalization backbone. Furthermore, we try to learn an ideal feature space which aggregates real speech and separates fake speech. Fake speech varies significantly by different forgery methods, while real speech varies less. In light of this, we further propose the aggregation and separation domain generalization (ASDG) method as the back-end to learn a domain invariant feature representation. Experiments show that our W2V-ASDG outperforms baseline models in cross-domains and gets the lowest average equal error rates (EER) of 4.60%.