This study proposes a novel speaker embedding extractor architecture that effectively combines convolutional neural networks (CNNs) and Transformers. Based on the recently proposed CNNs-meet-vision-Transformers (CMT) architecture, we propose two strategies for efficient speaker embedding extraction modeling. First, we apply broadcast residual learning techniques to the building blocks of the CMT, allowing us to extract frequency-aware temporal features shared across frequency dimensions with a reduced set of parameters. Second, frequency-statistics-dependent attentive statistics pooling is proposed to aggregate attentive temporal statistics acquired from the means and standard deviations of input feature maps weighted along the frequency axis using an attention mechanism. The experimental results on the VoxCeleb-1 dataset show that the proposed model outperforms several CNN- and Transformer-based models with a similar number of model parameters. Moreover, the effectiveness of the proposed modifications to the CMT architecture is validated through ablation studies.