Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet

Posted by bilsbie 5 hours ago

Counter4Comment3OpenOriginal

Comments

Comment by synapz_org 43 minutes ago

Author here. Some context on what this is and why it matters.

Covenant-72B is the largest language model pre-trained through fully permissionless, decentralized coordination. 72 billion parameters, approximately 1.1 trillion tokens, trained across 70+ contributors on commodity internet connections. No datacenter, no central cluster, and no whitelisting of participants. Anyone with GPUs could join or leave at any time during the run.

The two hard problems in this setting are bandwidth and trust.

For bandwidth: synchronizing full gradients for a 72B model over residential internet is not feasible. We developed SparseLoCo, which compresses gradient communication by over 146x. Each peer transmits 1.56% of a full gradient per round using top-k sparsification, 2-bit quantization, and error feedback. The result was 94.5% compute utilization and 70-second communication overhead per round (versus 8.3 minutes for INTELLECT-1, a whitelisted 10B run).

For trust: when anyone can participate, anyone can submit garbage updates. Gauntlet is our validation layer. It scores every submission every round by measuring loss improvement on assigned and held-out data, running integrity checks, and applying persistent ranking. Only top-scoring updates touch the model.

The base model is competitive with LLaMA-2-70B on ARC (trained on half the token budget). After fine-tuning, the chat model outperforms both K2-Chat and LLaMA-2-70B-Chat on IFEval and MATH.

Weights are Apache 2.0 on HuggingFace: https://huggingface.co/1Covenant/Covenant-72B

Built by Covenant AI with Mila Quebec. Happy to answer questions about the training protocol, compression methods, or the validation mechanism.

Comment by LuxBennu 5 hours ago

Interesting that SparseLoCo held up at 72B scale with permissionless participants. I run distributed inference across multiple machines over Tailscale (M2 Max + RTX 5070 Ti), and even in that controlled setup, network variance is the dominant bottleneck. The fact that they got competitive quality with peers joining and leaving freely on 1.1T tokens is impressive — though I'd love to see how much the blockchain verification overhead actually cost in effective compute utilization.

Comment by Kave0ne 5 hours ago

The Byzantine fault tolerance question here is interesting. With 72B parameters trained across untrusted peers, even a small fraction of malicious nodes could introduce subtle gradient poisoning that degrades model quality in non-obvious ways. Curious how they handle the verification overhead at scale - cryptographic proofs on gradient updates would add significant latency. Is the threat model just Sybil attacks, or also honest-but-curious nodes leaking gradient information?