Large speech foundation models like Whisper face significant deployment challenges due to their massive storage requirements. While post-training quantization (PTQ) offers a practical compression solution, existing methods suffer severe performance degradation below 8 bits, particularly for transformer-based architectures with prevalent weight outliers. We propose an ultra-low bit PTQ framework combining three key innovations: 1) K-means clustering for distribution-aware nonlinear quantization, 2) Mixed-precision allocation based on columnwise outlier density, and 3) Selective retention of critical outliers in sparse FP32 format. Evaluated on Whisper-Large-V3 (1.5B parameters), our method achieves 2.12-bit quantization with only 0.17% absolute WER increase on LibriSpeech test-clean. The approach also maintains whisper's robust capabilities, showing less than 1% WER degradation across multiple dataset.