Phase 4 Complete — Pipeline Parallelism Live

Trustless, Peer-to-Peer
AI Inference Network

Any device with a chip becomes a node. Pool consumer hardware into an omnipotent network that runs massive LLMs — no GPU cluster required.

View on GitHub Join Discord See the Demo

Rust Crates

Memory Copies

50%

RAM Saved / Node

< 2s

Peer Discovery

Four Pillars

The protocol is built on four foundational mechanisms

⚙

Compute

Pipeline parallelism shards model layers across devices, routing hidden-state tensors over a low-latency P2P mesh via pure QUIC.

Consumer devices pool their unified memory to run models no single device could hold.

📦

Storage

GGUF files are chunked by transformer block, content-addressed (BLAKE3 → CIDv1), and distributed via a custom 64 MiB sliding-window protocol.

No centralized model hosting. Weights are resilient, deduplicated, and globally available.

🔒

Privacy

Federated Learning lets contributors train locally on private data, uploading only mathematical weight gradients — never raw data.

Data sovereignty preserved. No central entity sees user data.

⚗

Incentives

zkML cryptographically proves correct inference. A Financial RLHF system stakes tokens, rewards quality, and slashes dishonest nodes.

Trustless verification. Economic alignment between node operators and users.

Architecture

7-crate Rust workspace — edition 2024, strictly bottom-up

omni-zkml

Zero-knowledge ML proofs (planned)

omni-node

CLI binary — listen, shard, fetch

omni-bridge

PyO3 FFI with zero-copy buffer protocol

omni-pipeline

Pipeline parallelism coordination

omni-net

P2P mesh networking (libp2p 0.55, QUIC)

omni-store

GGUF sharding & content-addressed storage

omni-types

Shared types, errors & config

Apple Silicon Zero-Copy Path

Rust mmap (GGUF shard file)File in unified memory — CPU + GPU share physical RAM

↓

PyO3 __getbuffer__Raw pointer exposed to Python — no copy

↓

NumPy frombuffer()Wraps pointer as ndarray — 120µs for 431 MB

↓

MLX mx.arrayMetal GPU reads directly from same physical memory

↓

GPU InferenceZero memory copies from disk to compute

Implementation Roadmap

Five phases, strictly bottom-up — each produces a working milestone

P2P Mesh Networking

omni-net

Complete

QUIC/v1 transport with TLS 1.3
mDNS zero-config LAN discovery
Gossipsub pub/sub with strict validation
request-response shard transfer protocol
OmniNet async channel API

GGUF Model Sharding

omni-store

Complete

Custom zero-copy GGUF parser (v2 & v3)
Layer-wise chunking by transformer block
BLAKE3 → CIDv1 content addressing
64 MiB sliding-window shard transfer
CBOR-serialized model manifest

FFI Bridge & Local Inference

omni-bridge

Complete

PyO3 0.23 + maturin native extension
Zero-copy __getbuffer__ over mmap
Store, Net, Pipeline Python bindings
Apple Silicon unified memory path
MLX GPU inference validated end-to-end

Pipeline Parallelism

omni-pipeline

Complete

Native GGUF-to-MLX bridge (no HuggingFace)
Decentralized RAM pooling (~50% per node)
Pure-QUIC autoregressive ping-pong loop
GPipe micro-batch scheduling
RAM-proportional layer planner

zkML & SUM Chain Tokenomics

omni-zkml

Planned

Dual prover: ezkl (Halo2) + RISC Zero (STARK)
Per-stage proof aggregation
On-chain proof verification
Staking / slashing economy
Financial RLHF reward distribution

Live Demo

Two Macs, one model, zero central servers — real autoregressive inference over LAN

Autoregressive Ping-Pong Protocol

Sender NodeMachine A

embed_tokens — token embedding
layers[0:N/2] — first half of transformer
KV Cache — independent, synced by order

~215 MB RAM (TinyLlama 1.1B)

Receiver NodeMachine B

layers[N/2:] — second half of transformer
norm + lm_head — final projection
argmax → token — greedy decoding

~215 MB RAM (TinyLlama 1.1B)

Sender── hidden_states (float16) ──>Receiver

Sender<── token_id (4 bytes LE) ──Receiver

Wire discriminator: hidden_dim == model_size → activation tensor | hidden_dim == 1 → token ID

Quick Start

# Prerequisites

pip install mlx mlx-lm transformers rich

# Machine A — Sender (embed + first half of layers)

python showcase_tui.py sender /path/to/model.gguf

# Machine B — Receiver (second half + norm + lm_head)

python showcase_tui.py receiver /path/to/model.gguf

Announcements

Latest updates from the OmniNode Protocol team

2026-03-19Talk

AI Tinkerers Talk Submission

We've submitted a live demo talk to AI Tinkerers: two MacBooks running distributed LLM inference on stage, showcasing the native GGUF bridge, RAM pooling, and the QUIC ping-pong protocol. All code is open source.

2026-03-15Milestone

Phase 4 Complete — Pipeline Parallelism is Live

Two Apple Silicon Macs can now split a model in half and run real autoregressive LLM inference across a LAN. Native GGUF-to-MLX bridge, decentralized RAM pooling, and pure-QUIC tensor routing — all working end-to-end with zero central servers.

2026-03-12Engineering

Native GGUF-to-MLX Bridge Replaces HuggingFace

We've eliminated all dependency on mlx_lm.load() and HuggingFace Hub. Model weights are now loaded directly from .gguf files using Apple's mx.load() API with a custom GGUF→MLX tensor key mapping. Architecture is inferred entirely from GGUF metadata.

2026-03-05Infrastructure

The iroh Pivot — Custom Shard Transfer Protocol

Discovered an unresolvable hickory-resolver feature conflict between iroh and libp2p 0.55. Built a custom 64 MiB sliding-window shard transfer protocol directly on libp2p request-response. Full control over wire format, chunking, and backpressure.

Trustless, Peer-to-PeerAI Inference Network

Four Pillars

Compute

Storage

Privacy

Incentives

Architecture

Apple Silicon Zero-Copy Path

Implementation Roadmap

P2P Mesh Networking

GGUF Model Sharding

FFI Bridge & Local Inference

Pipeline Parallelism

zkML & SUM Chain Tokenomics

Live Demo

Autoregressive Ping-Pong Protocol

Quick Start

Announcements

AI Tinkerers Talk Submission

Phase 4 Complete — Pipeline Parallelism is Live

Native GGUF-to-MLX Bridge Replaces HuggingFace

The iroh Pivot — Custom Shard Transfer Protocol

Trustless, Peer-to-Peer
AI Inference Network