Last updated: April 5, 2026 · Model Architecture · by Daniel Ashford
What is Quantization?
Compressing an LLM to use less memory by reducing numerical precision.
Definition
Quantization is a compression technique that reduces the numerical precision of model parameters from 16-bit floating point to lower-precision formats (8-bit, 4-bit, or 2-bit). This dramatically reduces memory requirements and can speed up inference.
How It Works
A 70B parameter model in 16-bit precision requires approximately 140GB of memory. Quantized to 4-bit, it fits in approximately 35GB — runnable on fewer GPUs. 8-bit quantization preserves 99%+ quality, while aggressive 2-bit can cause noticeable degradation. Popular formats include GGUF, GPTQ, and AWQ.
Example
Running Llama 4 405B at full precision requires 8x A100 GPUs. Quantized to 4-bit, it can run on 2-3x A100 GPUs with minimal quality loss.
Related Terms
See How Models Compare
Understanding quantization is important when choosing the right AI model. See how 12 models compare on our leaderboard.