Embeddings extracted by deep neural networks have become the state-of-the-art utterance representation in speaker verification (SV). Despite the various network architectures that have been investigated in previous works, how to design and scale up networks to achieve a better trade-off on performance and complexity in a principled manner has been rarely discussed in the SV field. In this paper, we first systematically study model scaling from the perspective of the depth and width of networks and empirically discover that depth is more important than the width of networks for speaker verification task. Based on this observation, we design a new backbone constructed entirely from standard convolutional network modules by significantly increasing the number of layers while maintaining the network complexity following the depth-first rule and scale it up to obtain a family of much deeper models dubbed DF-ResNets. Comprehensive comparisons with other state-of-the-art systems on the Voxceleb dataset demonstrate that DF-ResNets achieve a much better trade-off than previous SV systems in terms of performance and complexity.