In this paper, we aim to bridge the performance gap between TDNN and 2D CNN based speaker verification systems. Specifically, three types of architectural enhancements to ECAPA-TDNN are proposed: 1) follow depth-first design to significantly increase network depth while maintaining its complexity. 2) introduce recursive convolution to better capture fine-grained speaker information. 3) propose pyramid-based multi-path feature enhancement module to yield more discriminative speaker representation. Experiments on Voxceleb show that our final model, named ECAPA++, achieves 25%, 23% and 24% relative improvements on Vox1-O, E and H respectively, while with 2.4x fewer parameters and 2.3x fewer FLOPs over the previous best TDNN-based system. Meanwhile, it is comparable to the state-of-the-art ResNet-based systems with higher computational efficiency. In addition, further performance gains can be achieved by fusing ECAPA++ and ResNet-based systems.