Inference
Running a trained AI model to produce an answer. Inference is what happens when you type a prompt into ChatGPT and get a response. The model takes your input, computes a best guess, and returns it.
Also known as: model serving, prediction
Inference is the part of AI that users actually experience. Every time you type into a chat interface, generate an image, ask a voice assistant a question, or paste a document into a summariser, you’re running inference against a model someone else trained. Training built the model. Inference uses it.
The two phases have completely different economics. Training is a massive one-off investment: tens of millions of dollars of GPU time, huge datasets, months of engineering. Inference is a recurring cost: every query costs something in compute and memory, and those costs scale linearly with users. Training a frontier model might cost $50 million once; inference on that model might cost $50 million per month if enough people use it. That’s why the inference layer is where commercial AI is won or lost.
Inference cost per query depends on three things: how big the model is (bigger = more compute per token), how long the input and output are (context window matters), and how efficient the hardware is (H100 vs A100 vs consumer GPU). Quantisation (running the model in 4-bit or 8-bit precision instead of 16-bit) can cut inference cost by 2-4x with small quality losses. This is why open-weight models in smaller quantised versions are so important for sovereignty: they make self-hosted inference actually affordable.
The DeAI angle on inference is about who sees the input and who gets paid for the output. Centralised inference (OpenAI, Anthropic) means every query passes through the provider’s servers, they can read it, and they set the price. Decentralised inference (Venice, Morpheus, io.net, Akash, Bittensor subnets) distributes the work across independent operators, usually with some combination of privacy guarantees (TEEs, encrypted routing) and payment in the protocol’s token. Neither is universally better: centralised tends to be cheaper and higher quality today, decentralised wins on sovereignty and verifiability. The OYM project reviews consistently score inference projects on both axes.