Attention mechanism provides an effective and plug-and-play feature enhancement module for speaker embedding extractors. Attention-based pooling layers have been widely used to aggregate a sequence of frame-level feature vectors into an utterance-level speaker embedding. Besides, convolution attention mechanisms are introduced into convolution blocks to improve the sensibility of speaker embedding extractors to those features with more discriminative speaker characteristics. However, it is still a challenging problem to make a good trade off between performance and model complexity for convolution attention models, especially for speaker recognition systems on low-resource edge computing nodes (smartphone, embedded devices, etc.). In this paper, we propose a lightweight convolution attention model named as CTFALite, which learns channel-specific temporal attention and frequency attention by leveraging both of the global context information and the local cross-channel dependencies. Experiment results demonstrate the effectiveness of CTFALite for improving performance. The further analysis about computational resource consumption shows that CTFALite achieves a better trade-off between performance and computational complexity, compared to other competing lightweight convolution attention mechanisms.