AI Tinkerers Talk Submission
We've submitted a live demo talk to AI Tinkerers: two MacBooks running distributed LLM inference on stage, showcasing the native GGUF bridge, RAM pooling, and the QUIC ping-pong protocol. All code is open source.
Any device with a chip becomes a node. Pool consumer hardware into an omnipotent network that runs massive LLMs — no GPU cluster required.
The protocol is built on four foundational mechanisms
Pipeline parallelism shards model layers across devices, routing hidden-state tensors over a low-latency P2P mesh via pure QUIC.
Consumer devices pool their unified memory to run models no single device could hold.
GGUF files are chunked by transformer block, content-addressed (BLAKE3 → CIDv1), and distributed via a custom 64 MiB sliding-window protocol.
No centralized model hosting. Weights are resilient, deduplicated, and globally available.
Federated Learning lets contributors train locally on private data, uploading only mathematical weight gradients — never raw data.
Data sovereignty preserved. No central entity sees user data.
zkML cryptographically proves correct inference. A Financial RLHF system stakes tokens, rewards quality, and slashes dishonest nodes.
Trustless verification. Economic alignment between node operators and users.
7-crate Rust workspace — edition 2024, strictly bottom-up
omni-zkmlZero-knowledge ML proofs (planned)
omni-nodeCLI binary — listen, shard, fetch
omni-bridgePyO3 FFI with zero-copy buffer protocol
omni-pipelinePipeline parallelism coordination
omni-netP2P mesh networking (libp2p 0.55, QUIC)
omni-storeGGUF sharding & content-addressed storage
omni-typesShared types, errors & config
Five phases, strictly bottom-up — each produces a working milestone
omni-netomni-storeomni-bridgeomni-pipelineomni-zkmlTwo Macs, one model, zero central servers — real autoregressive inference over LAN
Machine Aembed_tokens — token embeddinglayers[0:N/2] — first half of transformerKV Cache — independent, synced by orderMachine Blayers[N/2:] — second half of transformernorm + lm_head — final projectionargmax → token — greedy decodingWire discriminator: hidden_dim == model_size → activation tensor | hidden_dim == 1 → token ID
Latest updates from the OmniNode Protocol team
We've submitted a live demo talk to AI Tinkerers: two MacBooks running distributed LLM inference on stage, showcasing the native GGUF bridge, RAM pooling, and the QUIC ping-pong protocol. All code is open source.
Two Apple Silicon Macs can now split a model in half and run real autoregressive LLM inference across a LAN. Native GGUF-to-MLX bridge, decentralized RAM pooling, and pure-QUIC tensor routing — all working end-to-end with zero central servers.
We've eliminated all dependency on mlx_lm.load() and HuggingFace Hub. Model weights are now loaded directly from .gguf files using Apple's mx.load() API with a custom GGUF→MLX tensor key mapping. Architecture is inferred entirely from GGUF metadata.
Discovered an unresolvable hickory-resolver feature conflict between iroh and libp2p 0.55. Built a custom 64 MiB sliding-window shard transfer protocol directly on libp2p request-response. Full control over wire format, chunking, and backpressure.