MisoTTS 8B: Open-Source Voice AI That Beats ElevenLabs on Latency

The text-to-speech landscape just got a major shakeup. Miso Labs has released MisoTTS 8B — an open-weights, emotionally expressive voice model that clocks in at 110ms latency, outpacing ElevenLabs (700ms) and Sesame CSM (300ms) by a wide margin. For developers building voice agents, accessibility tools, or real-time conversation interfaces, this changes the equation significantly.

What Is MisoTTS 8B?

MisoTTS 8B is an 8-billion-parameter text-to-speech model from Miso Labs, released under a modified MIT license with open weights available on Hugging Face. Unlike traditional TTS systems that convert text to audio through a fixed voice, MisoTTS conditions its output on both text and audio context — meaning it can mirror the emotional tone of a conversation, not just its words.

The product name "Miso One" wraps around the core model (formally called MisoTTS) and includes one-shot voice cloning from clips as short as 10 seconds.

Key headline numbers:

110ms latency (vs ElevenLabs 700ms, Sesame CSM 300ms)
8B parameters total (7.7B backbone + 300M depth decoder)
Open weights under modified MIT license
One-shot voice cloning from ~10-second audio clips

The Architecture Innovation: Residual Vector Quantization

Traditional TTS models represent audio through a single token vocabulary, which limits expressiveness. MisoTTS uses Residual Vector Quantization (RVQ) with 32 codebooks of 2048 dimensions each. Instead of a single token index per audio frame, the model emits a vector of 32 indices — giving it an addressable audio space of approximately 2048 to the power of 32, or roughly 10 to the power of 105 possible audio tokens.

This matters because human speech nuance — the slight quiver in a voice when nervous, the warmth in a friendly greeting — lives precisely in this high-dimensional space that single-token approaches cannot capture.

Dual-Transformer Design

MisoTTS uses a two-stage architecture:

Backbone (7.7B parameters): A Llama 3.2-style autoregressive transformer that processes interleaved text and audio tokens. It predicts the first codebook index (k₁) and produces a hidden state encoding emotional context.

Depth Decoder (300M parameters): A smaller autoregressive transformer that takes the backbone's hidden state and generates the remaining 31 codebook indices (k₂ through k₃₂). Parameters are reused across codebook positions via a shared weight scheme, keeping the decoder compact.

The audio tokenizer used is Mimi, with audio watermarking enabled by default via SilentCipher — important for responsible deployment.

Latency Comparison

Model	Latency	Open Weights	Voice Cloning
MisoTTS 8B	110ms	Yes (MIT)	Yes (1-shot)
Sesame CSM	300ms	Yes (Apache 2.0)	Limited
ElevenLabs	700ms	No	Yes
Kokoro TTS	~200ms	Yes	No

At 110ms, MisoTTS is close to the threshold where voice interaction feels genuinely real-time. ElevenLabs remains the benchmark for quality in many dimensions, but MisoTTS's open weights and latency profile make it compelling for use cases where self-hosting and speed matter.

Getting Started

Prerequisites

MisoTTS requires Python 3.10, a CUDA-capable GPU, and roughly 30–40GB of storage for the model download (weights + Mimi codec + watermarker).

Install the uv package manager:

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone and set up the environment:

git clone https://github.com/MisoLabsAI/MisoTTS.git
cd MisoTTS
uv sync --python 3.10
source .venv/bin/activate

Prefer pip? Use the alternative setup:

python3.10 -m venv .venv
source .venv/bin/activate
pip install -e .

Basic Text-to-Speech

import torch
import torchaudio
from generator import load_miso_8b
 
device = "cuda" if torch.cuda.is_available() else "cpu"
generator = load_miso_8b(
    device=device,
    model_path_or_repo_id="MisoLabs/MisoTTS"
)
 
audio = generator.generate(
    text="Welcome to our product. How can I help you today?",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)
torchaudio.save("output.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

The first run downloads the model weights from Hugging Face automatically and generates output.wav.

One-Shot Voice Cloning

This is where MisoTTS really shines. Provide a 10-second reference audio clip and the model will clone that voice's tone, pacing, and emotional character:

import torchaudio
from generator import Segment, load_miso_8b
 
generator = load_miso_8b(device="cuda")
 
# Load reference audio (your voice clone source)
prompt_audio, sample_rate = torchaudio.load("reference_voice.wav")
prompt_audio = torchaudio.functional.resample(
    prompt_audio.squeeze(0),
    orig_freq=sample_rate,
    new_freq=generator.sample_rate,
)
 
# Build context from reference segment
context = [
    Segment(
        speaker=0,
        text="Hello, this is the reference transcript.",
        audio=prompt_audio,
    )
]
 
# Generate speech in the cloned voice
audio = generator.generate(
    text="Your cloned voice says this sentence now.",
    speaker=0,
    context=context,
    max_audio_length_ms=10_000,
)
torchaudio.save("cloned_output.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

The context list acts as emotional and tonal grounding — the model conditions its output on the provided audio, not just the text.

Hardware Requirements

Precision	Model Size	VRAM Needed	Example GPUs
bfloat16 / fp16	~16 GB	24 GB	RTX 3090, RTX 4090, A5000, L4
float32	~33 GB	40 GB+	A100 40GB, A6000, H100

CPU inference is supported but slow, requiring approximately 20 GB of RAM in bfloat16 or 40 GB in float32. For production voice agents, a GPU with at least 24 GB VRAM is the practical minimum.

Use Cases for Developers

Voice Agents and Customer Support: At 110ms latency, MisoTTS can power truly responsive voice bots without the perceptible delay that breaks immersion in phone-style conversations.

Accessibility Tools: Screen readers and assistive technology benefit enormously from natural, emotionally varied speech — rather than the robotic monotone that accessibility users have tolerated for decades.

Content Creation: Podcast production, audiobook narration, and e-learning narration all benefit from one-shot voice cloning — allowing creators to generate consistent audio without re-recording sessions.

Privacy-First Deployments: Because MisoTTS runs fully on-premises, it's viable for industries with strict data residency requirements (healthcare, finance, legal) where sending audio to a third-party API is not acceptable.

MENA-Region Applications: While the current release is English-only, the open-weight model is a strong foundation for fine-tuning on Arabic, French, and other regional languages — a promising path for developers building for North African and Gulf markets.

Current Limitations

MisoTTS 8B is a strong first release, but several limitations are worth understanding before building on it:

English only. The current public release is English-focused. Multilingual support has not been announced for the near term.

Half-duplex only. The model generates complete audio turns but cannot overlap with incoming audio. True full-duplex conversation (where both parties can speak simultaneously) is flagged as future work by Miso Labs.

Single-turn generation. Each inference call handles one conversation turn. There is no built-in turn-taking logic — that responsibility falls to the application layer.

API not yet live. Miso Labs has announced API access is coming soon, but as of June 2026, self-hosting from the open weights is the only option.

Responsible Deployment

Audio is watermarked by default using SilentCipher, an imperceptible steganographic watermark that survives common audio transformations. This is a meaningful responsible-AI measure, especially given the one-shot voice cloning capability.

Developers building voice cloning features should implement their own consent mechanisms on top of this — watermarking alone does not prevent misuse, but it does create a technical record that audio was AI-generated.

What This Means for the Voice AI Ecosystem

MisoTTS 8B follows a pattern we've seen play out in image generation, coding assistants, and language models: a commercially compelling capability (emotionally expressive TTS at low latency) moves from proprietary-only to open-source, fundamentally changing who can build with it.

ElevenLabs built a strong business on voice cloning and quality. MisoTTS does not yet match ElevenLabs across all quality dimensions — but for latency-sensitive applications and privacy-first deployments, the gap is already closed. And with open weights, the gap will close faster as the community fine-tunes and improves the model.

For teams building voice agents today — particularly those targeting real-time conversation interfaces — MisoTTS 8B is worth evaluating seriously. The combination of open weights, 110ms latency, and one-shot voice cloning in a single model is genuinely novel.

Getting the Weights

The MisoTTS 8B weights are available at MisoLabs/MisoTTS on Hugging Face. The GitHub repository at MisoLabsAI/MisoTTS contains inference code, examples, and setup instructions. API access is forthcoming via the Miso Labs platform.

The open-source voice AI race is accelerating. If you've been waiting for an emotionally expressive, self-hostable TTS model with voice cloning, the wait is over.