Listen to this episode

Training Without a Data Centre

Decentralised AI training was supposed to be impossible. A 30,000x bandwidth gap said so. Then a Bittensor subnet trained a 72 billion parameter model over commodity internet. Here's what changed.

Inside a data centre, GPUs talk to each other at 1,800 GB/s. Your home internet does about 60 MB/s on a good day.

That is a 30,000x gap.

Training a large language modelLLMLarge Language Model. A neural network trained on vast amounts of text to predict the next word in a sequence. Modern LLMs (GPT, Claude, Llama, Qwen, DeepSeek) generate human-quality text and are the foundation of most modern AI products.Like an autocomplete that read every book ever written. It has no memory of individual texts but it has absorbed the patterns of language so deeply that it can generate paragraphs that sound human. The skill is statistical, not conscious.Read more → means synchronising what every GPUGPUGraphics Processing Unit. Originally designed to render video game graphics, GPUs turned out to be exceptionally good at the massively parallel math that AI models need. Modern AI training and inference runs almost entirely on GPUs.Like a factory with 10,000 workers doing the same simple task in parallel, versus a CPU which is more like 10 workers each doing different complex tasks. AI training involves doing simple math a million times per second on a million numbers, which is exactly what the GPU factory is designed for.Read more → has learned with every other GPU, thousands of times per step. Every step. For weeks. The entire architecture of modern AI training assumes those GPUs sit in the same building, wired together with specialist hardware that costs more than most houses.

This is why the received wisdom was simple: you cannot train serious models across the open internet. The bandwidth kills you. The latency kills you. The synchronisation overhead makes the whole thing pointless.

That wisdom held for years. It does not hold any more.

Why training is the hardest layer

InferenceInferenceRunning a trained AI model to produce an answer. Inference is what happens when you type a prompt into ChatGPT and get a response. The model takes your input, computes a best guess, and returns it.Like asking an expert for their opinion. The training was the decades they spent becoming an expert. The inference is the 30 seconds it takes them to answer your specific question.Read more → is the easy part of decentralisation. A user sends a promptPromptThe text you give an AI model to tell it what to generate. A prompt can be a simple question, a long instruction, a chunk of context plus a task, or a conversation history the model uses to produce its response.Like a brief you give to a junior designer. A vague brief gets a vague result. A detailed brief with context, constraints, and examples gets something usable. The quality of the output depends heavily on the quality of the brief.Read more →, a GPU processes it, the response comes back. Each request is independent. You can shard inference across thousands of nodes with minimal coordination. That is why projects like Akash and Bittensor’s Chutes subnet already handle real inference workloads today.

Training is fundamentally different. When you train a model, every GPU needs to know what every other GPU learned. Not eventually, but continuously. In standard data-parallel training, each GPU processes a different batch of data, computes gradients, and then all GPUs synchronise those gradients before taking the next step. Ring all-reduce (the standard synchronisation algorithm) is bandwidth-optimal within a data centre. Over the internet, it is a disaster.

Put numbers on it:

The bandwidth gap between data centre and internet training

ConnectionBandwidthContext
NVLink (intra-node) 1,800 GB/s GPU-to-GPU within a server
InfiniBand (inter-node) 50-100 GB/s Server-to-server in a data centre
Commodity internet ~0.06 GB/s What decentralised networks use

Tensor parallelism (taking one layer of a neural networkModelA trained neural network that takes inputs (text, images, audio) and produces outputs (more text, classifications, generated content). In DeAI the model is the thing that actually does the work.Like a very experienced apprentice who has spent years watching thousands of masters make furniture. They can't explain how they know when a joint is right, but they can make a chair that looks and functions like a Chippendale. The training is invisible. The output is what matters.Read more → and splitting its computations across multiple GPUs) cannot cross node boundaries at all. It needs NVLink-class bandwidth. Latency within a data centre is around 500 microseconds. Internet latency runs 50 to 150 milliseconds. That is a 100 to 300x penalty before you even think about throughput.

Then there is memory. A 70 billion parameter model in mixed precision needs roughly 140 GB just for the weights. Add optimiser states and activations, and you are looking at 400 to 600 GB. A single H100 has 80 GB. You need multi-GPU setups regardless of whether they are centralised or decentralised.

So how did anyone solve this?

The compression breakthrough

The answer is not faster internet. It is sending less data, less often.

DiLoCo (Distributed Low-Communication Learning), published by Google in 2023, rewrote the playbook. Instead of synchronising gradients every step, DiLoCo lets each worker run hundreds of local optimisation steps independently, then synchronise only compressed pseudo-gradients. The inner optimiser (AdamW) runs locally. The outer optimiser (Nesterov momentum) runs globally but infrequently. A 500x reduction in communication frequency. Add 8-bit quantisationQuantisationCompressing an AI model by storing each parameter with fewer bits of precision. Quantisation cuts model size and inference cost by 2-4x with small quality losses, making big models practical to run on consumer hardware.Like printing a high-resolution photo at lower DPI. The image is mostly the same, the details are slightly less crisp, and the file size drops dramatically. For most uses you can't tell the difference. For some uses the quality loss matters.Read more → on top and you get up to 2,000x total compression.

SparseLoCo, the algorithm behind Bittensor’s Covenant-72B, pushes this further. It selects only the top 1 to 3% of gradientGradientIn machine learning, the direction a model's parameters need to be adjusted to reduce its prediction error. Training is a long process of computing gradients and nudging the parameters in the direction the gradient suggests.Like a ball rolling down a hill. The slope of the hill at each point tells the ball which direction to move. The gradient is the slope. Training is letting the ball roll downhill many times until it settles into a low point.Read more → components (chunk-wise Top-k selection), quantises them to 2 bits, and carries forward the unselected components via error feedback. According to the Covenant-72B paper, compression hits 146x on top of the already-reduced sync frequency.

What does that look like in practice? The training run spent roughly 70 seconds communicating per 20-minute compute window: 94.5% compute utilisation on commodity 500 Mb/s internet. Not a theoretical claim. The model exists.

DisTrO, from Nous Research, takes a different approach entirely. It transforms gradients into the frequency domain (similar to how JPEG compresses images) and transmits only the low-frequency components. Nous reports 1,000 to 10,000x compression. Different maths, same insight: most of the information in a gradient update is redundant.

These are not incremental improvements. They are architectural shifts that make internet-speed training viable where it was previously impossible.

Three approaches to decentralised training

Not all decentralised training looks the same. Three distinct models have emerged, each solving a different problem. The structural difference between them is what gets distributed: compute, data, or trust.

Three approaches at a glance

ApproachWhat gets distributedWhere data livesWhat's verifiedLive example
Distributed pre-training Compute across many independent participants training one model Replicated to each worker Final model benchmarks Templar Covenant-72B; Prime Intellect INTELLECT-1/2
Federated learning Training itself (data stays put, only weight deltas sync) Stays at each data holder, never aggregated Aggregator-attested updates FLock (FLoRA, NeurIPS 2024)
Verified training Trust (cryptographic proof that training was correct) Typically centralised dataset Each computation step, not just the output Gensyn (Verde, pre-mainnet)

The three approaches solve different problems and ship at different stages of maturity. Distributed pre-training is producing competitive models today. Federated learning has shipped technical contributions (FLoRA at NeurIPS) and partial production deployments. Verified training is the most ambitious and the least proven.

1. Distributed pre-training: Templar and Prime Intellect

This is the most ambitious approach. Take a model architecture, split the training across independent participants worldwide, and produce a single trained model at the end. It is also the approach that produced the most dramatic proof point.

Covenant-72B (Templar)

Covenant-72B (Bittensor Subnet 3 / Templar) launched on 10 March 2026. Seventy-two billion parametersParametersThe internal numbers (weights and biases) inside a neural network that get adjusted during training. A 70-billion-parameter model has 70 billion adjustable internal numbers encoding everything it has learned.Like the synapses in a human brain. Each parameter is a tiny dial that gets nudged a little during training. With enough dials, the network can represent surprisingly complex patterns. The total parameter count is roughly how much "brain" the model has.Read more →. Trained from scratch on 1.1 trillion tokens of the DCLM-baseline dataset. Twenty or more independent participants contributing compute over six months, using SparseLoCo over commodity internet connections. Apache 2.0 licence. Weights on HuggingFace. arXiv paper.

The headline number: 67.1 on MMLU (zero-shot), beating LLaMA-2-70B’s 65.6. That is the number that made the rounds. Chamath Palihapitiya raised it on the All-In Podcast with Jensen Huang in late March, and Huang called the open-source approach complementary to proprietary AI.

Look closer and it is more nuanced.

Covenant-72B vs LLaMA-2-70B benchmarks (zero-shot, base model)

BenchmarkCovenant-72BLLaMA-2-70B
MMLU
67.1
65.6
ARC-Easy
80.9
79.6
ARC-Challenge
56.8
57.4
HellaSwag
80.6
84.3
WinoGrande
75.9
80.4
PIQA
81.6
82.6
OpenBookQA
44.0
49.4

Covenant wins on MMLU and ARC-Easy. LLaMA-2-70B wins on the other five benchmarks. The comparison is also to a model released in July 2023, not to current frontier. Nobody is claiming this beats LLaMA-3. The benchmarks are self-reported by the Templar team. The weights are public, so anyone can verify, but nobody has published independent results yet.

None of that diminishes what it actually proves. A 72 billion parameter model was trained from scratch by independent participants, over the open internet, using commodity hardware. The model works. It is competitive with a Meta model that cost tens of millions to train in a centralised cluster. Six months ago, this was a research paper. Now it is a downloadable checkpoint.

Prime Intellect: INTELLECT-1 and INTELLECT-2

Prime Intellect has been pushing similar boundaries. INTELLECT-1 (November 2024) trained a 10B model across 30 compute sponsors in five countries using DiLoCo, reporting 400x bandwidth reduction and 83% cross-continent utilisation. INTELLECT-2 (May 2025) scaled to 32B parameters with reinforcement learningRLReinforcement Learning. A training paradigm where an AI agent takes actions in an environment, receives rewards or penalties for the outcomes, and learns a policy that maximises long-term reward. Used heavily for aligning modern LLMs.Like training a dog with treats. Good behaviour gets a treat. Bad behaviour gets nothing or a reprimand. Over many repetitions the dog learns which behaviours produce treats and starts doing them on purpose.Read more →, beating QwQ-32B on AIME24 and LiveCodeBench according to their own evaluations.

Nous Research and Psyche

Nous Research (Psyche Network) raised $65M and is building DisTrO-based training on Solana. A December 2024 test trained a 15B model across distributed nodes. Nous reports that within 44 minutes of testnet launch, $500K worth of GPU power was contributed. Still early, but funded.

2. Federated learning: FLock

FLock solves a different problem. Instead of distributing training compute, it distributes training data.

The premise: valuable training data often cannot leave its source. Medical records stay in hospitals. Financial data stays behind compliance boundaries. Personal data should not be aggregated by a third party. Federated learning trains models where the data lives, sharing only model updates (gradients or weight deltas), never the raw data.

FLock’s technical contribution is FLoRA (Federated Low-Rank Adaptation), published at NeurIPS 2024. Paper. The problem it solves: when different nodes train LoRA adapters of different sizes (because they have different hardware and different data), naive aggregation introduces mathematical noise. FLoRA uses stacking-based aggregation with scaling factors proportional to local data size, reducing trainable parameters to 0.78% of the full model while allowing heterogeneous clients.

The results so far: 16 consecutive training tasks on AI Arena, 9,062 training submissions, 196 training nodes, and 11.9M FLOCK in protocol fees (roughly $2.7M) over 10 months. FLock 2025 Earnings Report. These are self-reported numbers, but the revenue figure represents actual on-chain protocol fees, not projections.

FLock’s partnerships with UNDP, Hong Kong’s HKGAI, and NHS hospitals suggest real demand for privacy-preserving training. The degree of operational deployment versus memoranda of understanding is unclear from public sources.

This approach will not produce the next GPT. It is not trying to. It is trying to unlock training on data that centralised labs cannot access. That is a genuinely different value proposition.

3. Verified training: Gensyn

Gensyn is solving the trust problem. If someone claims they trained your model correctly on untrusted hardware, how do you verify that without re-running the entire computation?

Their answer is Verde, a verification protocol that uses a two-level bisection game. When a result is disputed, the first bisection narrows the disagreement to a specific training iteration. The second narrows to a specific operation within that iteration. A referee (smart contractSmart ContractA program stored on a blockchain that runs automatically when its conditions are met. Smart contracts are how blockchains do anything beyond just transferring tokens — DeFi, NFTs, DAOs, and DeAI infrastructure all run on smart contracts.Like a vending machine. You put in the right input and it produces the expected output, no human operator required. The rules are fixed in the machine itself, anyone can use it, and nobody can stop a transaction in the middle.Read more → or verifier jury) recomputes only that single disputed operation. If at least one verifier is honest, the correct result is guaranteed.

Overhead is less than one order of magnitude. For comparison, full cryptographic proof systems (like zkML) add four orders of magnitude or more. Verde makes verification practical for real training workloads. Verde paper

Gensyn is still pre-mainnet. The testnet reported 2 million models trained and 165,000 users through RL Swarm, though these numbers come from the team’s own press releases around their December 2025 token saleICOInitial Coin Offering. A token sale where a project sells tokens directly to the public, usually before any product exists. ICOs dominated 2017-2018 funding and are now mostly replaced by airdrops, IDOs, or fair launches.Like a company selling shares to the public before going public, except with no SEC oversight, no audited financials, and often no product at all. The 2017 ICO boom showed why those guardrails exist in traditional finance.Read more →. Gensyn docs. The $43M Series A led by a16z in June 2023 gives them runway, but mainnet has been “expected in 2026” without a confirmed date.

Verification matters because it is what separates decentralised training from just hoping your compute providers did the work correctly. Templar uses an on-chain contribution scoring system (Gauntlet) based on loss evaluation. But Gensyn’s approach is more rigorous, catching not just poor contributions but actively adversarial ones. If decentralised training scales beyond ideologically motivated participants to mercenary compute providers, verification becomes essential.

The gap that remains

EpochEpochA fixed-length period in a Proof of Stake blockchain during which the validator set is stable and rewards are calculated. Epochs are the natural unit for staking rewards and network state changes.Like a payroll period at a job. Within the period, your role is fixed and your pay is calculated based on hours worked. At the end, the period closes, paychecks are issued, and a new period begins with potentially different conditions.Read more → AI, an independent research organisation, published a direct comparison in early 2026. Templar’s network throughput: roughly 9 x 10^17 FLOP/s. Frontier AI data centres: roughly 3 x 10^20 FLOP/s. That is a 300x gap. Epoch AI analysis.

Compression is not free. Epoch estimates that scaling DiLoCo from 1 to 8 nodes is equivalent to a 1.5x decrease in effective training compute. At 10,000 nodes, you would need 6x as much total FLOP as a centralised run to achieve the same result.

Gap to frontier data centres
300x
FLOP overhead at 10K nodes
6x
Decentralised compute growth
20x/year

But look at the trajectory. Decentralised training compute has grown roughly 20x per year since 2020, compared to 5x per year for frontier clusters. At those rates, Epoch suggests decentralised networks could theoretically match prior-generation frontier capacity in about five and a half years.

Here is the thing. Dario Amodei has said frontier training runs are approaching $1 billion in 2026, heading toward $10 billion by 2028. Decentralised networks do not need to match that. They need to serve the vast market of organisations that will never spend $1 billion on a training run but still need custom models. That is a fundamentally different competition.

Where decentralised training fits

Decentralised training is not going to replace NVIDIA’s GB200 NVL72 racks. Not this year. Probably not in five years. The bandwidth gap, while narrowing, still makes centralised clusters better for training the largest frontier models from scratch.

But “train the largest frontier model from scratch” is one use case. Decentralised training already works (or will soon) across several others.

Fine-tuningFine-tuningThe process of taking a pre-trained model and training it further on a smaller, more specialised dataset to adapt it to a specific task, domain, or style. Fine-tuning is much cheaper than training from scratch.Like hiring an experienced general practitioner doctor and giving them six months of focused training in a sub-speciality. You don't have to teach them medicine from scratch. You just narrow their expertise to the area you actually need.Read more → and adaptation is the obvious one. You do not need 30,000x bandwidth to fine-tune a 7B model on domain-specific data. LoRA adapters are small. Communication overhead is manageable. FLock’s entire business model is built on this, and it is generating revenue.

Then there is privacy. When the data cannot move, the training must come to the data. Medical records staying in hospitals. Financial data locked behind compliance boundaries. No centralised lab can serve these use cases without becoming a custodian of the data they are meant to protect.

Specialised models at scale play to decentralisation’s strengths too. Thousands of niche models, each trained by communities with domain expertise that no centralised lab would bother learning. Bittensor’s subnet architecture was designed for exactly this.

Censorship resistance matters more than it sounds. If a government or corporate policy prohibits training on certain data or for certain purposes, a permissionless network routes around it. Regulatory divergence on AI training data is already creating demand for this.

And for the long tail of organisations that need custom models but will never spend $1 billion on a training run, cost-efficient pre-trainingTrainingThe one-time process of teaching a neural network to perform a task by showing it massive amounts of example data and adjusting its internal weights until the outputs are good. Training builds the model; inference uses it.Like the years an apprentice spends learning a trade. You don't see any of the actual work, just thousands of repeated mistakes gradually becoming competence. By the end, the apprentice can do the job. The training was invisible, but the skill is now permanent.Read more → on decentralised networks is simply cheaper. A well-trained 7B to 70B model, customised and sovereign, covers most enterprise use cases.

Covenant-72B did not match frontier capability. What it proved is more interesting: a group of independent participants, using commodity hardware and standard internet connections, can coordinate well enough to produce a model that works. Not a toy. Not a proof of concept. A 72 billion parameter checkpoint you can download today.

Twelve months ago, that sentence would have been aspirational. Now it is a link to HuggingFace.

Centralised labs will keep pushing the frontier. Decentralised training is not trying to stop them. It is building the layer underneath: sovereign, permissionless, and available to everyone who does not have a billion-dollar compute budget. That market is a lot bigger than the frontier.

Score changes, new reviews, one editorial take every two weeks. No spam.