For two years the answer to "how do I run a frontier model privately" was a single line: you can't, not without a rack of NVIDIA GPUs and a five-figure budget. In 2026 that answer changed. An open-source project called exo lets you pool the memory of several Apple Silicon Macs into one cluster and run models that no single machine could hold — a 671-billion-parameter model on eight Mac minis sitting on a shelf.
For businesses in the MENA region weighing data sovereignty against the cost and privacy trade-offs of cloud APIs, this is a genuinely new option. Let us unpack how it works and whether it belongs in your stack.
The core idea: pooling unified memory
A large language model's main constraint is memory. A 70-billion-parameter model in 8-bit precision needs roughly 70 GB just to hold its weights; a 671B model needs hundreds of gigabytes. No consumer machine ships with that much.
Apple Silicon has one unusual property that makes it interesting here: unified memory. The CPU, GPU, and Neural Engine all share the same fast memory pool, so a Mac with 64 GB of unified memory can dedicate almost all of it to model weights. exo takes the next step — it stitches the unified memory of many Macs together so the combined pool can hold a model far larger than any one device.
Eight M4 Pro Mac minis with 64 GB each give you 512 GB of addressable memory. That is enough to load DeepSeek V3 at 671 billion parameters and serve it at around 5.37 tokens per second — faster, in fact, than a 70B model runs on the same hardware, because DeepSeek V3 is a mixture-of-experts model that activates only a fraction of its weights per token.
How exo splits a model
exo uses two complementary strategies to spread a model across devices.
Pipeline parallelism slices the model into contiguous groups of layers — called shards — and assigns each shard to a different device. A token flows through device one's layers, then its small activation vector (typically under 4 KB) is passed to device two, and so on. Because only tiny activations cross the network, bandwidth is rarely the bottleneck for single requests.
Tensor parallelism splits individual layers across devices so they compute in parallel, then combines the results. This is more network-intensive but, with a fast enough interconnect, it makes each request genuinely faster rather than just increasing throughput.
The clever part is that exo decides which strategy to use automatically. Every node scans the network in real time — measuring link type, latency, bandwidth, and available memory — and builds a topology map. It then places shards to match each device's resources, so a faster Mac Studio carries more layers than an older mini.
Zero configuration, by design
You do not assign a master node or edit IP addresses. exo devices discover each other peer-to-peer (the project moved to the Zenoh protocol for this), forming a flat, egalitarian cluster. You install exo on each Mac, connect them, and the cluster assembles itself.
Just as important, exo speaks the APIs your tools already use. It is compatible with the OpenAI Chat Completions API, the Claude Messages API, the OpenAI Responses API, and the Ollama API. That means an app pointed at OpenAI can be redirected to your local cluster by changing a single base URL — no rewrite required.
# After installing exo on each Mac, the cluster exposes an OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3",
"messages": [{"role": "user", "content": "Summarize our Q2 sales data."}]
}'Under the hood, exo runs on MLX, Apple's machine-learning framework tuned for the Metal GPU and unified memory, which is what makes Apple Silicon competitive for inference in the first place.
The 2026 breakthrough: RDMA over Thunderbolt 5
Until recently, the weak link in a Mac cluster was the network between devices. Standard TCP over Thunderbolt added roughly 300 microseconds of latency per hop — fine for pipeline parallelism, but enough to erase the gains from tensor parallelism. Add more nodes and single-request speed often went down.
The 2026 release of exo 1.0 changes the math with day-zero support for RDMA (Remote Direct Memory Access) over Thunderbolt 5, available on macOS 26.2. RDMA lets one device read another's memory almost as if it were local, collapsing inter-device latency from around 300 microseconds to as little as 3 to 9 microseconds — a reduction of roughly 99 percent.
The practical effect is that tensor parallelism finally scales the right way:
- 1.8× faster on 2 devices
- 3.2× faster on 4 devices
Adding hardware now adds speed instead of subtracting it. On a cluster of four high-end Mac Studios, Qwen3 at 235 billion parameters scales from about 19 tokens per second on one node to roughly 32 tokens per second across four — interactive speed for a model that would otherwise demand a data-center GPU.
To enable RDMA you need M4 Pro or M4 Max silicon (base M4 chips use Thunderbolt 4 without RDMA), high-quality Thunderbolt 5 cables, and a one-time rdma_ctl enable from Recovery mode on each node. After that exo detects and prefers the RDMA links automatically.
Why a MENA business might care
Three reasons make this more than a hobbyist curiosity:
- Data sovereignty. Customer records, financial data, and unpublished strategy never leave your office. For regulated sectors and for organizations cautious about sending data abroad, on-premise inference removes an entire category of risk.
- Zero marginal cost. Cloud inference bills per token, and an active team can run that into thousands of dollars a month. A Mac cluster has an upfront hardware cost and an electricity bill — and Apple Silicon is remarkably power-efficient — but no per-request charge. Heavy, steady workloads amortize the hardware quickly.
- Reuse what you own. exo does not require a matched set. An older Mac mini retired from a desk can become an inference node alongside a new Mac Studio. The topology-aware scheduler simply gives it the work it can handle.
The honest caveats
This is powerful but not magic. A few things to weigh before you buy cables:
- Thunderbolt does not scale infinitely. There are no native Thunderbolt switches, so full-mesh RDMA is practical up to roughly 4 to 8 nodes. Larger clusters fall back to 10-gigabit Ethernet for some links, which is slower.
- It is built for inference, not training. exo serves models well; it is not the tool for fine-tuning large models from scratch.
- The software is still maturing. Expect occasional stability quirks and plan to spend time on the Discord and GitHub README. This is leading-edge tooling, not a turnkey appliance.
- Cabling and cooling get real at scale. Eight Macs and a web of Thunderbolt cables need power, airflow, and a plan.
A sensible starting point
The right first step is small: two M4 Pro Mac minis connected over Thunderbolt, exo installed from the macOS app or source, RDMA enabled if your hardware supports it. That pair will comfortably run capable mixture-of-experts models at interactive speed and prove the workflow against your real prompts before you commit to a larger build.
The bigger story is the shift exo represents. Inference is moving to the edge — onto devices you control, in your own building, under your own keys. For teams that have spent two years watching private data flow to someone else's data center, that is a meaningful change. The cluster on the shelf is no longer a demo. In 2026, it is a deployment option.
Want help deciding whether local inference fits your workload, or designing the hardware and software around it? Noqta builds AI infrastructure for businesses across Tunisia, Saudi Arabia, and the wider MENA region — get in touch.
Sources: exo on GitHub · exolabs.net · 12 days of EXO benchmarks