Phase 4 Complete — Pipeline Parallelism Live

Trustless, Peer-to-Peer
AI Inference Network

Any device with a chip becomes a node. Pool consumer hardware into an omnipotent network that runs massive LLMs — no GPU cluster required.

7
Rust Crates
0
Memory Copies
50%
RAM Saved / Node
< 2s
Peer Discovery

Four Pillars

The protocol is built on four foundational mechanisms

Compute

Pipeline parallelism shards model layers across devices, routing hidden-state tensors over a low-latency P2P mesh via pure QUIC.

Consumer devices pool their unified memory to run models no single device could hold.

📦

Storage

GGUF files are chunked by transformer block, content-addressed (BLAKE3 → CIDv1), and distributed via a custom 64 MiB sliding-window protocol.

No centralized model hosting. Weights are resilient, deduplicated, and globally available.

🔒

Privacy

Federated Learning lets contributors train locally on private data, uploading only mathematical weight gradients — never raw data.

Data sovereignty preserved. No central entity sees user data.

Incentives

zkML cryptographically proves correct inference. A Financial RLHF system stakes tokens, rewards quality, and slashes dishonest nodes.

Trustless verification. Economic alignment between node operators and users.

Architecture

7-crate Rust workspace — edition 2024, strictly bottom-up

omni-zkml

Zero-knowledge ML proofs (planned)

omni-node

CLI binary — listen, shard, fetch

omni-bridge

PyO3 FFI with zero-copy buffer protocol

omni-pipeline

Pipeline parallelism coordination

omni-net

P2P mesh networking (libp2p 0.55, QUIC)

omni-store

GGUF sharding & content-addressed storage

omni-types

Shared types, errors & config

Apple Silicon Zero-Copy Path

Rust mmap (GGUF shard file)File in unified memory — CPU + GPU share physical RAM
PyO3 __getbuffer__Raw pointer exposed to Python — no copy
NumPy frombuffer()Wraps pointer as ndarray — 120µs for 431 MB
MLX mx.arrayMetal GPU reads directly from same physical memory
GPU InferenceZero memory copies from disk to compute

Implementation Roadmap

Five phases, strictly bottom-up — each produces a working milestone

1

P2P Mesh Networking

omni-net
Complete
  • QUIC/v1 transport with TLS 1.3
  • mDNS zero-config LAN discovery
  • Gossipsub pub/sub with strict validation
  • request-response shard transfer protocol
  • OmniNet async channel API
2

GGUF Model Sharding

omni-store
Complete
  • Custom zero-copy GGUF parser (v2 & v3)
  • Layer-wise chunking by transformer block
  • BLAKE3 → CIDv1 content addressing
  • 64 MiB sliding-window shard transfer
  • CBOR-serialized model manifest
3

FFI Bridge & Local Inference

omni-bridge
Complete
  • PyO3 0.23 + maturin native extension
  • Zero-copy __getbuffer__ over mmap
  • Store, Net, Pipeline Python bindings
  • Apple Silicon unified memory path
  • MLX GPU inference validated end-to-end
4

Pipeline Parallelism

omni-pipeline
Complete
  • Native GGUF-to-MLX bridge (no HuggingFace)
  • Decentralized RAM pooling (~50% per node)
  • Pure-QUIC autoregressive ping-pong loop
  • GPipe micro-batch scheduling
  • RAM-proportional layer planner
5

zkML & SUM Chain Tokenomics

omni-zkml
Planned
  • Dual prover: ezkl (Halo2) + RISC Zero (STARK)
  • Per-stage proof aggregation
  • On-chain proof verification
  • Staking / slashing economy
  • Financial RLHF reward distribution

Live Demo

Two Macs, one model, zero central servers — real autoregressive inference over LAN

Autoregressive Ping-Pong Protocol

Sender NodeMachine A
  • embed_tokens — token embedding
  • layers[0:N/2] — first half of transformer
  • KV Cache — independent, synced by order
~215 MB RAM (TinyLlama 1.1B)
Receiver NodeMachine B
  • layers[N/2:] — second half of transformer
  • norm + lm_head — final projection
  • argmax → token — greedy decoding
~215 MB RAM (TinyLlama 1.1B)
Sender── hidden_states (float16) ──>Receiver
Sender<── token_id (4 bytes LE) ──Receiver

Wire discriminator: hidden_dim == model_size → activation tensor  |  hidden_dim == 1 → token ID

Quick Start

# Prerequisites
pip install mlx mlx-lm transformers rich

# Machine A — Sender (embed + first half of layers)
python showcase_tui.py sender /path/to/model.gguf

# Machine B — Receiver (second half + norm + lm_head)
python showcase_tui.py receiver /path/to/model.gguf

Announcements

Latest updates from the OmniNode Protocol team

Talk

AI Tinkerers Talk Submission

We've submitted a live demo talk to AI Tinkerers: two MacBooks running distributed LLM inference on stage, showcasing the native GGUF bridge, RAM pooling, and the QUIC ping-pong protocol. All code is open source.

Milestone

Phase 4 Complete — Pipeline Parallelism is Live

Two Apple Silicon Macs can now split a model in half and run real autoregressive LLM inference across a LAN. Native GGUF-to-MLX bridge, decentralized RAM pooling, and pure-QUIC tensor routing — all working end-to-end with zero central servers.

Engineering

Native GGUF-to-MLX Bridge Replaces HuggingFace

We've eliminated all dependency on mlx_lm.load() and HuggingFace Hub. Model weights are now loaded directly from .gguf files using Apple's mx.load() API with a custom GGUF→MLX tensor key mapping. Architecture is inferred entirely from GGUF metadata.

Infrastructure

The iroh Pivot — Custom Shard Transfer Protocol

Discovered an unresolvable hickory-resolver feature conflict between iroh and libp2p 0.55. Built a custom 64 MiB sliding-window shard transfer protocol directly on libp2p request-response. Full control over wire format, chunking, and backpressure.