Gradient
In machine learning, the direction a model's parameters need to be adjusted to reduce its prediction error. Training is a long process of computing gradients and nudging the parameters in the direction the gradient suggests.
Also known as: gradient descent
Gradients are the mathematical machinery that makes training neural networks possible. After the model produces an output for a training example, an error function (the “loss”) measures how wrong the output was. The gradient is the partial derivative of that loss with respect to every parameter in the model: it tells you, for each of the model’s billions of internal numbers, exactly how much you’d need to nudge that number to reduce the error a little. Apply that nudge to all the parameters and the model gets slightly less wrong on the next example.
The actual training loop is just this process repeated billions of times. Show the model an input. Compute the output. Compare to the correct answer. Compute the gradient with respect to every parameter. Nudge each parameter in the direction the gradient suggests. Move on to the next example. After enough iterations, the parameters settle into a configuration that produces accurate outputs on most inputs. The whole field of deep learning is built on this loop.
Gradients are also where decentralised training gets hard. In a centralised training cluster, all the GPUs share the same gradient computation through high-speed interconnects (NVLink, InfiniBand). They can synchronise the gradients hundreds of times per second with negligible overhead. In a decentralised setup where the GPUs are spread across the public internet, that synchronisation is the bottleneck. Sending raw gradients between distant nodes for every update would take so long that training would be orders of magnitude slower than centralised.
The fix is gradient compression: techniques like top-k sparsification (only send the most significant gradient values), 2-bit quantisation (compress each value to 2 bits), and chunked communication (overlap computation and communication). Together these techniques cut bandwidth by 100-1000x. Templar’s Covenant-72B paper claimed 146x compression versus dense gradients using a method called SparseLoCo. Without these tricks, decentralised training of frontier models would be infeasible. With them, it’s slow but possible.