Gemini 3.5 Pro is Google's most capable model in the 3.x generation — a tier above Flash designed for tasks that demand massive context, extended reasoning, and frontier multimodal performance. It currently runs in Vertex AI enterprise preview, with a full public launch through Google AI Studio expected imminently.
This guide covers what Gemini 3.5 Pro delivers, how to access it today, when to choose it over Flash, and practical code to get started.
Specs at a Glance
| Capability | Gemini 3.5 Pro | Gemini 3.5 Flash |
|---|---|---|
| Context window | 2M tokens | 1M tokens |
| Output limit | 64K tokens | 32K tokens |
| Deep Think mode | Yes | Yes |
| Multimodal | Text, image, audio, video | Text, image, audio, video |
| Best for | Long-context, complex reasoning | Agentic loops, high throughput |
| Availability | Vertex AI preview | GA (AI Studio + API) |
A 2M token context window translates to roughly 1,500 pages of text or 30,000 lines of code in a single API call — the largest production context window of any frontier model as of mid-2026, double that of Gemini 3.5 Flash.
Access: Vertex AI Enterprise Preview
Gemini 3.5 Pro is live for enterprise customers on Vertex AI. To request access:
- Open Vertex AI Model Garden in the Google Cloud Console
- Search for
gemini-3.5-pro - Request allowlist access through your account team, or contact your CSM if you are a Gemini Enterprise customer
Once approved, the model ID to reference is gemini-3.5-pro-preview-06.
For individual developers, monitor aistudio.google.com — Google typically adds models to the picker without a formal announcement. You can also poll programmatically:
import google.generativeai as genai
for m in genai.list_models():
if "3.5" in m.name and "pro" in m.name.lower():
print(m.name)Do not hardcode gemini-3.5-pro in production yet. Use the preview suffix (gemini-3.5-pro-preview-06) until GA is confirmed.
The 2M Context Window in Practice
The step from 1M to 2M tokens unlocks a different category of workloads:
- Whole-codebase analysis — Feed an entire large repository for security audits, refactoring suggestions, or onboarding documentation generation
- Multi-document synthesis — Process hundreds of PDFs, legal contracts, or research papers in a single pass
- Extended agent sessions — Conversations spanning hours without context truncation or state compression
- Full regulatory filings — Analyze complete SEC filings or compliance documents without chunking
Architecture improvements in 3.5 Pro address quality degradation that affected earlier 3.1 Pro models at high context utilization. Quality stays consistent across the full 2M window.
Deep Think Mode
Deep Think is a reasoning mode that trades latency for accuracy on complex problems. The model runs multiple internal analysis paths before producing its final answer — the chain-of-thought stays hidden from the output.
Enable it via the thinkingConfig parameter.
Python (Google Gen AI SDK):
from google import genai
from google.genai.types import GenerateContentConfig, ThinkingConfig
client = genai.Client()
response = client.models.generate_content(
model="gemini-3.5-pro-preview-06",
contents="Analyze the security implications of JWT tokens with symmetric keys in a multi-tenant SaaS application.",
config=GenerateContentConfig(
thinking_config=ThinkingConfig(
thinking_level="high"
)
)
)
print(response.text)TypeScript (@google/genai SDK):
import { GoogleGenAI } from "@google/genai";
const client = new GoogleGenAI({
vertexai: true,
project: process.env.GOOGLE_CLOUD_PROJECT!,
location: "global",
});
const response = await client.models.generateContent({
model: "gemini-3.5-pro-preview-06",
contents: "Refactor this codebase to use dependency injection.",
config: {
thinkingConfig: {
thinkingLevel: "medium",
},
},
});
console.log(response.text);Available thinking levels are minimal, low, medium, and high. Reasoning tokens count against your context budget and are billed at output token rates. Avoid Deep Think for real-time voice agents or interactive coding flows — the added latency makes the experience noticeably slower.
Vertex AI Environment Setup
Set these environment variables before calling the API:
export GOOGLE_CLOUD_PROJECT=your-project-id
export GOOGLE_CLOUD_LOCATION=global
export GOOGLE_GENAI_USE_ENTERPRISE=TrueInstall the Python SDK:
pip install google-genaiInstall the TypeScript SDK:
npm install @google/genaiAn OpenAI-compatible endpoint is also available. For teams migrating from GPT-based infrastructure, swap the base URL and model name — most existing code works without further changes.
Flash vs Pro: The Decision Framework
Use Gemini 3.5 Flash when:
- Running agentic loops with many short calls
- Building RAG pipelines or search-augmented applications
- Response latency matters (sub-second requirements)
- Cost is a primary constraint — Flash costs roughly 8 to 10 times less than Pro
- Context fits comfortably under 500K tokens
Use Gemini 3.5 Pro when:
- Workloads regularly exceed 1M tokens of context
- You need complex multi-step reasoning with Deep Think
- Hallucination cost is high (contracts, medical analysis, legal domains)
- Tasks involve whole-codebase or multi-document analysis
- Frontier multimodal performance across all modalities is required
The key diagnostic: if your application is hitting 80–90% of Flash's 1M context limit, evaluate Pro. If the bottleneck is throughput or cost, stay on Flash.
Benchmark Performance
Gemini 3.5 Pro scores 44.4% on Humanity's Last Exam, compared to 40.2% for Flash. On SWE-Bench, the model targets performance in the GPT-5.5 range (around 58.6%). For tasks requiring deep analysis across massive context windows, Pro consistently outperforms Flash on quality metrics.
Pricing (Preview Estimates)
Official pricing releases at GA. Based on preview data and the historical Flash-to-Pro ratio in Google's pricing structure:
| Tier | Input | Output |
|---|---|---|
| Standard context (under 200K tokens) | ~$12–15 per million | ~$36–45 per million |
| Long context (above 200K tokens) | ~$15–18 per million | ~$45–54 per million |
| Cached input | ~$1.20–1.80 per million | — |
Context caching delivers up to 90% savings on repeated prompts — critical for production deployments that reuse large system prompts or document contexts across multiple requests.
Getting Started Today
If you are already using Gemini 3.5 Flash for production workloads, start by auditing your token usage. Run your highest-context tasks and measure where they fall relative to the 1M window. If you are regularly hitting 700K–900K tokens, Pro is the natural next step.
For teams evaluating from scratch, the Gemini 3.5 Flash Developer Guide is the right starting point — Flash covers the majority of use cases at a fraction of the cost. Step up to Pro when the context math demands it.
Conclusion
Gemini 3.5 Pro fills the niche between Flash's speed and the edge-case demands of frontier research and enterprise workloads. The 2M context window and Deep Think mode are not features for every application — but for whole-codebase analysis, complex document synthesis, or reasoning-heavy workflows where accuracy matters more than latency, they justify the step up from Flash.
Enterprise access is available now on Vertex AI. A public Gemini API launch is expected within weeks. Set up your Vertex AI project today and begin testing your high-context workloads before the GA wave arrives.