AI & machine learning

Quantisation

Compressing an AI model by storing each parameter with fewer bits of precision. Quantisation cuts model size and inference cost by 2-4x with small quality losses, making big models practical to run on consumer hardware.

Also known as: quantization, model compression

Quantisation is the most important practical optimisation for running AI models on hardware you can afford. By default, model parameters are stored as 16-bit or 32-bit floating-point numbers. A 70-billion-parameter model at FP16 takes 140GB of memory, which won’t fit on any single consumer GPU. Quantising the same model to 8-bit cuts memory to 70GB. To 4-bit, 35GB. To 2-bit, ~17GB. The same 70B model can go from “needs a data centre” to “runs on a single RTX 4090” through quantisation alone.

The tradeoff is precision. Each parameter has slightly less detail, so the model’s outputs are slightly less accurate. The quality loss is usually small for 8-bit (often imperceptible), noticeable but acceptable for 4-bit (the GGUF format used by llama.cpp is mostly 4-bit), and significant for 2-bit and below (only useful if you really need the size reduction). Modern quantisation techniques (GPTQ, AWQ, BitsAndBytes, EXL2) use clever tricks to minimise the quality loss while still hitting aggressive compression ratios.

Different parts of a model can be quantised differently. Some layers (typically attention layers) are more sensitive to precision loss than others (typically feedforward layers). Mixed-precision quantisation keeps the sensitive layers at higher precision while aggressively quantising the rest. This is most of what makes 4-bit quantisation practical: not all 4-bit numbers are created equal, and the smart placement of higher-precision values matters more than the average bit width.

In DeAI, quantisation is what makes self-hosted inference plausible for non-experts. Running a 70B model on a $1500 RTX 4090 would be impossible without 4-bit quantisation; with it, the model fits in VRAM and runs at usable speeds. Open-weight models on HuggingFace are usually published in both full-precision (for fine-tuning) and quantised (for inference) versions. The OYM “Why Self-Host Your AI” article walks through the practical hardware-vs-quantisation tradeoffs for different model sizes.

Related terms