Powered by self-supervised learning (SSL) on vast amounts of unlabeled data, a computationally intensive audiovisual encoder -a hybrid architecture combining ResNet and transformer in series- achieves state-of-the-art performance in audiovisual speech recognition (AV-ASR). In this work, we are the first to apply joint distillation and pruning (DP) with a teacher-student model for an efficient and noise-robust audiovisual encoder. First, we compress the transformer of the AV encoder. Second, we extend joint DP to both the ResNet and transformer of the hybrid AV encoder. In addition, we provide analyses on the teacher and the final student, respectively. With a similar number of parameters, our proposed student outperforms the previous state-of-the-art in clean condition (word error rate of 3.1% vs. 4.6%) and across all noisy conditions, while at the same time reducing computational complexity by 31.8%. Our code is at GitHub.