Although deep learning-based audio-visual speech recognition (AVSR) systems recognize base closed-set categories well, extending their discerning ability to additional novel categories with limited labeled training data is challenging since the model easily over-fits. In this paper, we propose Prototype-based Co-Adaptation with Transformer (Proto-CAT), a multi-modal generalized few-shot learning (GFSL) method for AVSR systems. In other words, Proto-CAT learns to recognize a novel class multi-modal object with few-shot training data, while maintaining its ability on those base closed-set categories. The main idea is to transform the prototypes (i.e., class centers) by incorporating cross-modality complementary information and calibrating cross-category semantic differences. In particular, Proto-CAT co-adapts the embeddings from audio-visual and category levels, so that it generalizes its predictions on all categories dynamically. Proto-CAT achieves state-of-the-art performance on various AVSR-GFSL benchmarks. The code is available at https://github.com/ZhangYikaii/Proto-CAT.