Google DeepMind Launches Gemini 3.1 Flash TTS With 200+ Audio Tags

Google DeepMind launched Gemini 3.1 Flash TTS on April 15, 2026, its most expressive text-to-speech model to date, giving developers fine-grained control over vocal style, pacing, and emotion through a new system of inline audio tags. The model is available in preview via the Gemini API, Google AI Studio, Vertex AI, and through Google Vids for Workspace users.

Key Highlights

More than 200 audio tags let creators direct delivery with simple bracketed commands like [whispered], [excited], or [shouting]
Native multi-speaker dialogue for podcasts, audiobooks, and conversational agents
Support for more than 70 languages with localized accent control, including American "Valley" and "Southern," plus British "Brixton" and "RP"
Elo score of 1,211 on the Artificial Analysis TTS leaderboard, placing it in the most attractive quality-to-cost quadrant
Every output carries a SynthID watermark to detect AI-generated audio

Details

The model identifier on the Gemini API is gemini-3.1-flash-tts-preview and it only produces audio output. Unlike previous TTS systems that required complex markup, Gemini 3.1 Flash TTS interprets natural-language direction placed directly into the transcript. Writers can change tone mid-sentence, assign regional accents, and control pacing without switching to SSML or proprietary markup.

Multi-speaker scenes are a first-class feature. Developers can define named voices such as "Puck (Upbeat)" and "Kore (Firm)," then script back-and-forth dialogue that the model renders with consistent character voices and natural turn-taking.

Impact

For creators in education, accessibility, and content production, Gemini 3.1 Flash TTS collapses the gap between a written script and a finished voice track. Early partners including StyleUAI, HeyGen, and Invideo AI praised the model for giving them the kind of precise, expressive delivery that previously required a voice actor and a recording session.

Enterprise adopters also gain a provenance trail: SynthID watermarking lets platforms detect AI-generated audio downstream, a feature Google positions as a guardrail against misinformation and deepfakes in regulated industries.

Background

Gemini 3.1 Flash TTS sits alongside the previously released Gemini 3.1 Flash Live, which handles real-time conversational voice. Where Flash Live is optimized for low-latency dialogue, TTS focuses on production-grade audio where creators need to iterate on tone, performance, and scene direction. The two models share the Gemini 3.1 audio backbone but target different workloads.

What's Next

Google says broader voice catalog expansion, additional language coverage, and general availability pricing are expected in the coming months. Workspace users can already try the model through Google Vids, and developers building voice agents, audiobooks, or e-learning content can request preview access through Google AI Studio today.

Source: Google Blog