Gradient Compression — Technical Glossary

Training large neural networks across many GPUs requires those GPUs to constantly exchange gradient updates. After each batch of training data, every GPU has computed its own gradient (its own opinion about which way the parameters should move) and they need to combine those gradients into a single update before the next batch. In a centralised cluster with high-speed interconnects (NVLink, InfiniBand), this exchange happens in microseconds. Across the public internet, the same exchange would take seconds or minutes per step, which makes training thousands of times slower.

Gradient compression solves this by shrinking the gradients before sending them. The main techniques are top-k sparsification (only send the largest 1-10% of gradient values, treating the rest as zero), quantisation (compress each value from 32 bits down to 4, 2, or even 1 bit), and chunked communication (overlap communication and computation so the network isn’t waiting for the next gradient exchange to start the next batch). Combined, these techniques cut the data exchanged between GPUs by 100-1000x with surprisingly small accuracy losses on the final trained model.

The mathematical insight is that most of the information in a gradient is concentrated in a small fraction of its values. Most parameters need only a tiny update (close to zero) on any given step, so dropping them costs almost nothing. The few parameters with large updates carry most of the signal, and those are exactly the values top-k sparsification preserves. Quantisation works because the quality of training is more sensitive to the direction of the gradient than its precise magnitude.

In DeAI, gradient compression is the technical foundation that makes decentralised training plausible. Templar’s Covenant-72B paper (March 2026) describes SparseLoCo, a specific gradient compression scheme that combines top-k sparsification, 2-bit quantisation, and chunked communication to achieve over 146x compression versus dense gradients. Nous Research’s DisTrO (used in their Psyche network) uses a different but related set of techniques. Without these compression methods, training a 72B model across a few dozen GPUs spread across the public internet would take orders of magnitude longer than centralised training. With them, it’s slow but possible.

Related terms