Training Without a Data Centre

Inside a data centre, GPUs talk to each other at 1,800 GB/s. Your home internet does about 60 MB/s on a good day.

That is a 30,000x gap.

Training a large language model means synchronising what every GPU has learned with every other GPU, thousands of times per step. Every step. For weeks. The entire architecture of modern AI training assumes those GPUs sit in the same building, wired together with specialist hardware that costs more than most houses.

This is why the received wisdom was simple: you cannot train serious models across the open internet. The bandwidth kills you. The latency kills you. The synchronisation overhead makes the whole thing pointless.

That wisdom held for years. It does not hold any more.

Why training is the hardest layer

Inference is the easy part of decentralisation. A user sends a prompt, a GPU processes it, the response comes back. Each request is independent. You can shard inference across thousands of nodes with minimal coordination. That is why projects like Akash and Bittensor’s Chutes subnet already handle real inference workloads today.

Training is fundamentally different. When you train a model, every GPU needs to know what every other GPU learned. Not eventually, but continuously. In standard data-parallel training, each GPU processes a different batch of data, computes gradients, and then all GPUs synchronise those gradients before taking the next step. Ring all-reduce (the standard synchronisation algorithm) is bandwidth-optimal within a data centre. Over the internet, it is a disaster.

Put numbers on it:

The bandwidth gap between data centre and internet training

Connection	Bandwidth	Context
NVLink (intra-node)	1,800 GB/s	GPU-to-GPU within a server
InfiniBand (inter-node)	50-100 GB/s	Server-to-server in a data centre
Commodity internet	~0.06 GB/s	What decentralised networks use

Tensor parallelism (taking one layer of a neural network and splitting its computations across multiple GPUs) cannot cross node boundaries at all. It needs NVLink-class bandwidth. Latency within a data centre is around 500 microseconds. Internet latency runs 50 to 150 milliseconds. That is a 100 to 300x penalty before you even think about throughput.

Then there is memory. A 70 billion parameter model in mixed precision needs roughly 140 GB just for the weights. Add optimiser states and activations, and you are looking at 400 to 600 GB. A single H100 has 80 GB. You need multi-GPU setups regardless of whether they are centralised or decentralised.

So how did anyone solve this?

The compression breakthrough

The answer is not faster internet. It is sending less data, less often.

DiLoCo (Distributed Low-Communication Learning), published by Google in 2023, rewrote the playbook. Instead of synchronising gradients every step, DiLoCo lets each worker run hundreds of local optimisation steps independently, then synchronise only compressed pseudo-gradients. The inner optimiser (AdamW) runs locally. The outer optimiser (Nesterov momentum) runs globally but infrequently. A 500x reduction in communication frequency. Add 8-bit quantisation on top and you get up to 2,000x total compression.

SparseLoCo, the algorithm behind Bittensor’s Covenant-72B, pushes this further. It selects only the top 1 to 3% of gradient components (chunk-wise Top-k selection), quantises them to 2 bits, and carries forward the unselected components via error feedback. According to the Covenant-72B paper, compression hits 146x on top of the already-reduced sync frequency.

What does that look like in practice? The training run spent roughly 70 seconds communicating per 20-minute compute window: 94.5% compute utilisation on commodity 500 Mb/s internet. Not a theoretical claim. The model exists.

DisTrO, from Nous Research, takes a different approach entirely. It transforms gradients into the frequency domain (similar to how JPEG compresses images) and transmits only the low-frequency components. Nous reports 1,000 to 10,000x compression. Different maths, same insight: most of the information in a gradient update is redundant.

These are not incremental improvements. They are architectural shifts that make internet-speed training viable where it was previously impossible.

Three approaches to decentralised training

Not all decentralised training looks the same. Three distinct models have emerged, each solving a different problem. The structural difference between them is what gets distributed: compute, data, or trust.

Three approaches at a glance

Approach	What gets distributed	Where data lives	What's verified	Live example
Distributed pre-training	Compute across many independent participants training one model	Replicated to each worker	Final model benchmarks	Templar Covenant-72B; Prime Intellect INTELLECT-1/2
Federated learning	Training itself (data stays put, only weight deltas sync)	Stays at each data holder, never aggregated	Aggregator-attested updates	FLock (FLoRA, NeurIPS 2024)
Verified training	Trust (cryptographic proof that training was correct)	Typically centralised dataset	Each computation step, not just the output	Gensyn (Verde, pre-mainnet)

The three approaches solve different problems and ship at different stages of maturity. Distributed pre-training is producing competitive models today. Federated learning has shipped technical contributions (FLoRA at NeurIPS) and partial production deployments. Verified training is the most ambitious and the least proven.

1. Distributed pre-training: Templar and Prime Intellect

This is the most ambitious approach. Take a model architecture, split the training across independent participants worldwide, and produce a single trained model at the end. It is also the approach that produced the most dramatic proof point.

Covenant-72B (Templar)

Covenant-72B (Bittensor Subnet 3 / Templar) launched on 10 March 2026. Seventy-two billion parameters. Trained from scratch on 1.1 trillion tokens of the DCLM-baseline dataset. Twenty or more independent participants contributing compute over six months, using SparseLoCo over commodity internet connections. Apache 2.0 licence. Weights on HuggingFace. arXiv paper.

The headline number: 67.1 on MMLU (zero-shot), beating LLaMA-2-70B’s 65.6. That is the number that made the rounds. Chamath Palihapitiya raised it on the All-In Podcast with Jensen Huang in late March, and Huang called the open-source approach complementary to proprietary AI.

Look closer and it is more nuanced.

Covenant-72B vs LLaMA-2-70B benchmarks (zero-shot, base model)

Benchmark	Covenant-72B	LLaMA-2-70B
MMLU	67.1	65.6
ARC-Easy	80.9	79.6
ARC-Challenge	56.8	57.4
HellaSwag	80.6	84.3
WinoGrande	75.9	80.4
PIQA	81.6	82.6
OpenBookQA	44.0	49.4

Covenant wins on MMLU and ARC-Easy. LLaMA-2-70B wins on the other five benchmarks. The comparison is also to a model released in July 2023, not to current frontier. Nobody is claiming this beats LLaMA-3. The benchmarks are self-reported by the Templar team. The weights are public, so anyone can verify, but nobody has published independent results yet.

None of that diminishes what it actually proves. A 72 billion parameter model was trained from scratch by independent participants, over the open internet, using commodity hardware. The model works. It is competitive with a Meta model that cost tens of millions to train in a centralised cluster. Six months ago, this was a research paper. Now it is a downloadable checkpoint.

Prime Intellect: INTELLECT-1 and INTELLECT-2

Prime Intellect has been pushing similar boundaries. INTELLECT-1 (November 2024) trained a 10B model across 30 compute sponsors in five countries using DiLoCo, reporting 400x bandwidth reduction and 83% cross-continent utilisation. INTELLECT-2 (May 2025) scaled to 32B parameters with reinforcement learning, beating QwQ-32B on AIME24 and LiveCodeBench according to their own evaluations.

Nous Research and Psyche

Nous Research (Psyche Network) raised $65M and is building DisTrO-based training on Solana. A December 2024 test trained a 15B model across distributed nodes. Nous reports that within 44 minutes of testnet launch, $500K worth of GPU power was contributed. Still early, but funded.

2. Federated learning: FLock

FLock solves a different problem. Instead of distributing training compute, it distributes training data.

The premise: valuable training data often cannot leave its source. Medical records stay in hospitals. Financial data stays behind compliance boundaries. Personal data should not be aggregated by a third party. Federated learning trains models where the data lives, sharing only model updates (gradients or weight deltas), never the raw data.

FLock’s technical contribution is FLoRA (Federated Low-Rank Adaptation), published at NeurIPS 2024. Paper. The problem it solves: when different nodes train LoRA adapters of different sizes (because they have different hardware and different data), naive aggregation introduces mathematical noise. FLoRA uses stacking-based aggregation with scaling factors proportional to local data size, reducing trainable parameters to 0.78% of the full model while allowing heterogeneous clients.

The results so far: 16 consecutive training tasks on AI Arena, 9,062 training submissions, 196 training nodes, and 11.9M FLOCK in protocol fees (roughly $2.7M) over 10 months. FLock 2025 Earnings Report. These are self-reported numbers, but the revenue figure represents actual on-chain protocol fees, not projections.

FLock’s partnerships with UNDP, Hong Kong’s HKGAI, and NHS hospitals suggest real demand for privacy-preserving training. The degree of operational deployment versus memoranda of understanding is unclear from public sources.

This approach will not produce the next GPT. It is not trying to. It is trying to unlock training on data that centralised labs cannot access. That is a genuinely different value proposition.

3. Verified training: Gensyn

Gensyn is solving the trust problem. If someone claims they trained your model correctly on untrusted hardware, how do you verify that without re-running the entire computation?

Their answer is Verde, a verification protocol that uses a two-level bisection game. When a result is disputed, the first bisection narrows the disagreement to a specific training iteration. The second narrows to a specific operation within that iteration. A referee (smart contract or verifier jury) recomputes only that single disputed operation. If at least one verifier is honest, the correct result is guaranteed.

Overhead is less than one order of magnitude. For comparison, full cryptographic proof systems (like zkML) add four orders of magnitude or more. Verde makes verification practical for real training workloads. Verde paper

Gensyn is still pre-mainnet. The testnet reported 2 million models trained and 165,000 users through RL Swarm, though these numbers come from the team’s own press releases around their December 2025 token sale. Gensyn docs. The $43M Series A led by a16z in June 2023 gives them runway, but mainnet has been “expected in 2026” without a confirmed date.

Verification matters because it is what separates decentralised training from just hoping your compute providers did the work correctly. Templar uses an on-chain contribution scoring system (Gauntlet) based on loss evaluation. But Gensyn’s approach is more rigorous, catching not just poor contributions but actively adversarial ones. If decentralised training scales beyond ideologically motivated participants to mercenary compute providers, verification becomes essential.

The gap that remains

Epoch AI, an independent research organisation, published a direct comparison in early 2026. Templar’s network throughput: roughly 9 x 10^17 FLOP/s. Frontier AI data centres: roughly 3 x 10^20 FLOP/s. That is a 300x gap. Epoch AI analysis.

Compression is not free. Epoch estimates that scaling DiLoCo from 1 to 8 nodes is equivalent to a 1.5x decrease in effective training compute. At 10,000 nodes, you would need 6x as much total FLOP as a centralised run to achieve the same result.

Gap to frontier data centres

300x

FLOP overhead at 10K nodes

Decentralised compute growth

20x/year

But look at the trajectory. Decentralised training compute has grown roughly 20x per year since 2020, compared to 5x per year for frontier clusters. At those rates, Epoch suggests decentralised networks could theoretically match prior-generation frontier capacity in about five and a half years.

Here is the thing. Dario Amodei has said frontier training runs are approaching $1 billion in 2026, heading toward $10 billion by 2028. Decentralised networks do not need to match that. They need to serve the vast market of organisations that will never spend $1 billion on a training run but still need custom models. That is a fundamentally different competition.

Where decentralised training fits

Decentralised training is not going to replace NVIDIA’s GB200 NVL72 racks. Not this year. Probably not in five years. The bandwidth gap, while narrowing, still makes centralised clusters better for training the largest frontier models from scratch.

But “train the largest frontier model from scratch” is one use case. Decentralised training already works (or will soon) across several others.

Fine-tuning and adaptation is the obvious one. You do not need 30,000x bandwidth to fine-tune a 7B model on domain-specific data. LoRA adapters are small. Communication overhead is manageable. FLock’s entire business model is built on this, and it is generating revenue.

Then there is privacy. When the data cannot move, the training must come to the data. Medical records staying in hospitals. Financial data locked behind compliance boundaries. No centralised lab can serve these use cases without becoming a custodian of the data they are meant to protect.

Specialised models at scale play to decentralisation’s strengths too. Thousands of niche models, each trained by communities with domain expertise that no centralised lab would bother learning. Bittensor’s subnet architecture was designed for exactly this.

Censorship resistance matters more than it sounds. If a government or corporate policy prohibits training on certain data or for certain purposes, a permissionless network routes around it. Regulatory divergence on AI training data is already creating demand for this.

And for the long tail of organisations that need custom models but will never spend $1 billion on a training run, cost-efficient pre-training on decentralised networks is simply cheaper. A well-trained 7B to 70B model, customised and sovereign, covers most enterprise use cases.

Covenant-72B did not match frontier capability. What it proved is more interesting: a group of independent participants, using commodity hardware and standard internet connections, can coordinate well enough to produce a model that works. Not a toy. Not a proof of concept. A 72 billion parameter checkpoint you can download today.

Twelve months ago, that sentence would have been aspirational. Now it is a link to HuggingFace.

Centralised labs will keep pushing the frontier. Decentralised training is not trying to stop them. It is building the layer underneath: sovereign, permissionless, and available to everyone who does not have a billion-dollar compute budget. That market is a lot bigger than the frontier.