ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Quantization-aware and Tensor-compressed Training of Transformers for Natural Language Understanding

Zi Yang, Samridhi Choudhary, Siegfried Kunzmann, Zheng Zhang

Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, significantly reducing model parameters. A quantization-aware training with learnable scales factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, layer-by-layer distillation is applied to distill a quantized tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to 63 times compression ratio with little accuracy loss and remarkable inference and training speedup.