AI & machine learning

RL

Reinforcement Learning. A training paradigm where an AI agent takes actions in an environment, receives rewards or penalties for the outcomes, and learns a policy that maximises long-term reward. Used heavily for aligning modern LLMs.

Also known as: Reinforcement Learning, RLHF

Reinforcement learning is one of the three main ML paradigms (along with supervised and unsupervised learning), and it’s the one most associated with classic AI breakthroughs: AlphaGo beating the world’s best Go players, OpenAI Five beating professional Dota teams, robotic arms learning to manipulate objects through trial and error. The defining characteristic is that the model learns from outcomes rather than from labeled examples. There’s no “correct answer” given upfront. Instead the model tries actions, sees what happens, and gradually learns which sequences of actions lead to good outcomes.

For language models, RL became central with RLHF (Reinforcement Learning from Human Feedback). The setup: take a pre-trained LLM, ask humans to rank multiple outputs the model produces for the same prompt, train a “reward model” on the human preferences, then fine-tune the LLM using RL to maximise the reward model’s score. The result is a model that’s not just predicting the next token but actively trying to produce outputs humans prefer. This is most of what makes ChatGPT feel different from GPT-3 even though they share the same base architecture.

DPO (Direct Preference Optimisation) and constitutional AI are alternative methods that achieve similar goals without the full RL machinery. They work by directly fine-tuning on preference pairs or by using AI feedback instead of human feedback. These methods are usually cheaper and more stable than RLHF, which is why most modern open-weight models use them in some combination. Llama 4, Qwen 3, Hermes 4, and most other recent models are post-trained with a mix of SFT, RLHF, and DPO.

In DeAI, RL training is one of the workloads decentralised compute networks have actively explored. Prime Intellect’s INTELLECT-2 was a 32B RL fine-tune of QwQ-32B, distributed across globally permissionless workers. Templar’s Covenant runs on Bittensor used RL-style competitive training in their subnet incentive mechanism. RL is computationally cheaper than full pre-training (you start from an existing model) which makes it a more achievable target for distributed networks, but the synchronisation requirements are still significant. The OYM “Decentralised AI Training” article covers the practical landscape.

Related terms