ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Squashed Weight Distribution for Low Bit Quantization of Deep Models

Nikko Strom, Haidar Khan, Wael Hamza

Inference with large deep learning models in resource-constrained settings is increasingly a bottleneck in real-world applications of state-of-the-art AI. Here we address this by low-precision weight quantization. We achieve very low accuracy degradation by re-parametrizing the weights in a way that leaves the weight distribution approximately uniform. We show lower bit-width quantization and less accuracy degradation than previously reported in experiments on GLUE benchmarks (3-bit, 0.2% rel. degradation), and on internal intent/slot-filling datasets (2-bit, 0.4% rel. degradation).