writing/blog/2026/06
BlogJun 29, 2026·6 min read

Exo Labs: Run Frontier AI Models Locally Across Multiple Devices

Learn how Exo Labs clusters Apple Silicon Macs to run 235B+ parameter models locally — OpenAI-compatible APIs, zero cloud bills, full data privacy.

Running billion-parameter AI models locally used to require expensive server hardware. Exo Labs changes that by connecting everyday Apple Silicon Macs into a distributed inference cluster capable of running models like Qwen3-235B, DeepSeek v3.1 671B, and Kimi K2 Thinking — no cloud bills, no data leaving your network.

Paired with NVIDIA's DGX Spark for GPU-based setups, local AI infrastructure in 2026 is more accessible than ever. Here is a practical guide to getting started.

Why Local AI Matters in 2026

Three forces are pushing developers toward local inference:

Cost control: High-volume workloads hit cloud API rate limits and generate unpredictable per-token bills. Local inference turns a one-time hardware investment into unlimited inference capacity.

Privacy and compliance: Data-sensitive industries — legal, healthcare, finance — need inference pipelines that never touch external servers. In MENA markets, Tunisia's INPDP law and Saudi Arabia's PDPL increasingly require data residency by design.

Capability parity: Open-weight models like Qwen3-235B and DeepSeek v3.1 now compete with cloud frontier models on many benchmarks. The quality gap that once justified cloud dependency has narrowed dramatically.

What Is Exo Labs?

Exo is an open-source framework that transforms a group of Apple Silicon Macs into a unified local AI cluster. It handles:

  • Automatic device discovery: Devices running Exo identify each other on the local network without manual configuration
  • Topology-aware model splitting: Distributes model layers across devices based on available memory using tensor parallelism
  • RDMA over Thunderbolt 5: Achieves a 99% reduction in inter-device latency on macOS 26.2+ with Thunderbolt 5 connections
  • Standard-compatible APIs: Supports OpenAI Chat Completions, Claude Messages API, and Ollama — your existing tools and SDKs work unchanged

How Model Sharding Works

When you load Qwen3-235B (which requires roughly 120 GB of memory), Exo distributes the model layers across your connected devices. Each device handles a subset of transformer layers and passes activations to the next node.

Performance scaling:

  • 2-device cluster: Up to 1.8x speedup over a single device
  • 4-device cluster: Up to 3.2x speedup

The Exo dashboard (accessible at http://localhost:52415/) shows live device topology, memory usage per node, and active model assignments.

Installation

Prerequisites: Xcode Command Line Tools, Homebrew, uv (Python package manager), Node.js, Rust nightly

Option 1: macOS App (simplest)

brew install --cask exo

Option 2: From Source

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
 
# Clone repository and build dashboard
git clone https://github.com/exo-explore/exo
cd exo/dashboard && npm install && npm run build && cd ..
 
# Launch the cluster node
uv run exo

The old pip install exo-explore is deprecated — use uv for all installations.

Run uv run exo on every device you want in the cluster. They discover each other automatically over the local network.

Running Your First Model

Query the running cluster using the standard OpenAI Chat Completions format:

curl http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Llama-3.2-1B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "Explain model sharding in simple terms."}
    ]
  }'

Or use the Claude Messages API format:

curl http://localhost:52415/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-235B-A22B-4bit",
    "messages": [
      {"role": "user", "content": "What is distributed inference?"}
    ],
    "max_tokens": 1024
  }'

Any SDK that speaks OpenAI-compatible APIs — LangChain, LlamaIndex, Vercel AI SDK — points to http://localhost:52415 and works without modification.

Supported Models

Exo loads any MLX-compatible model from HuggingFace Hub. Key models available today:

ModelParametersCluster Requirement
Llama 3.2 Instruct (4-bit)1B – 70BSingle device
Qwen3-235B-A22B (4-bit)235B2+ device cluster
DeepSeek v3.1 (4-bit)671B4+ device cluster
Kimi K2 Thinking~1T4+ high-memory devices

For single-device testing, start with a 1B or 8B model. Once the cluster is stable, graduate to the larger models.

Multi-Device Networking

For standard Ethernet or Wi-Fi networks, Exo handles inter-device communication automatically. For maximum performance, connect devices via Thunderbolt 5 daisy-chaining. macOS 26.2 or later is required to enable the RDMA path, which eliminates most network overhead and brings inter-device latency to near-local-memory speeds.

Three-device ring and four-device switch topologies are both supported. Exo selects the optimal communication pattern based on detected network links.

NVIDIA DGX Spark: The Enterprise Alternative

For teams on NVIDIA hardware, DGX Spark provides a comparable local inference experience. The NemoClaw stack installs in a single command:

curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash

This automates Node.js installation, OpenShell sandbox setup, and model download (Qwen3.6-35B by default). NVFP4 quantization with vLLM delivers 2.6x faster inference compared to baseline on the DGX Spark hardware.

Multi-node DGX clusters support 256 GB to 512 GB of unified memory across 2 to 4 units, connected via ConnectX-7 networking at 200 Gbps RoCE — the enterprise equivalent of Thunderbolt 5 clustering.

Practical Use Cases for MENA Teams

Regulated data pipelines: Process invoices, contracts, and customer records through a local model without exposing data to third-party APIs. Satisfies data residency requirements under INPDP (Tunisia) and PDPL (Saudi Arabia) by design.

Offline-capable AI: Run inference in environments with limited connectivity — factory floors, remote sites, air-gapped networks. Exo keeps working when the internet does not.

Cost containment at scale: Replace recurring per-token cloud bills with a one-time Mac Studio or Mac Pro investment. For teams running thousands of daily inference calls, the break-even point arrives within months.

Development and testing: Run the same model your production cloud uses, locally, for faster iteration without egress costs.

Known Limitations

  • GPU acceleration requires Apple Silicon and macOS: Linux support exists but runs on CPU only; GPU acceleration for Linux is under active development
  • Thunderbolt 5 and macOS 26.2+ for RDMA: Older hardware and macOS versions still work but at higher inter-device latency
  • Large models require memory-rich clusters: DeepSeek 671B needs four Mac Studio Ultra units or equivalent; budget accordingly
  • MLX format only: Models must be in MLX-compatible format; GGUF models (used by Ollama) require a separate conversion step

Conclusion

Exo Labs makes frontier-scale AI inference practical on hardware many developers already own. As open-weight models close the gap with proprietary cloud offerings on benchmark after benchmark, the case for local AI infrastructure grows stronger — especially in privacy-sensitive, compliance-driven markets like MENA.

NVIDIA's DGX Spark offers an enterprise-grade path for teams already in the NVIDIA ecosystem. Together, these tools mark the beginning of a shift where the cloud is one option among many, not the only option for running capable AI.

To get started: brew install --cask exo, launch it on two Macs, and point your existing OpenAI SDK at http://localhost:52415. You likely have enough hardware to run something useful already.