ISCA Archive Interspeech 2025
ISCA Archive Interspeech 2025

Ultra-Low Bit Post-Training Quantization of Large Speech Models via K-Means Clustering and Mixed Precision Allocation

Tianteng Gu, Bei Liu, Haoyu Wang, Yanmin Qian

Large speech foundation models like Whisper face significant deployment challenges due to their massive storage requirements. While post-training quantization (PTQ) offers a practical compression solution, existing methods suffer severe performance degradation below 8 bits, particularly for transformer-based architectures with prevalent weight outliers. We propose an ultra-low bit PTQ framework combining three key innovations: 1) K-means clustering for distribution-aware nonlinear quantization, 2) Mixed-precision allocation based on columnwise outlier density, and 3) Selective retention of critical outliers in sparse FP32 format. Evaluated on Whisper-Large-V3 (1.5B parameters), our method achieves 2.12-bit quantization with only 0.17% absolute WER increase on LibriSpeech test-clean. The approach also maintains whisper's robust capabilities, showing less than 1% WER degradation across multiple dataset.