For years, adding AI to a web app meant one thing: send the user's data to a cloud API, pay per token, and hope the latency gods are kind. Chrome's built-in AI changes that equation. With Gemini Nano shipping inside the browser itself, the Prompt API lets you run language model inference entirely on the user's device — no API key, no per-request cost, no data leaving the machine.

In this tutorial, you will build a complete on-device AI feature set: availability detection, session management, streaming responses, structured JSON output, and a production-grade hybrid pattern that falls back to a cloud model when the local one is not available. Everything runs in plain JavaScript, so you can drop it into any framework — Next.js, Nuxt, SvelteKit, or vanilla HTML.

Prerequisites

Before starting, ensure you have:

Chrome 138 or later on desktop (Windows 10/11, macOS 13+, Linux, or ChromeOS). The Prompt API graduated from origin trial and is available in stable Chrome for extensions and, progressively, for websites.
At least 22 GB of free disk space on the volume that contains your Chrome profile — Gemini Nano is downloaded once and shared across all sites.
A GPU with more than 4 GB of VRAM (integrated GPUs on recent machines qualify).
Basic knowledge of JavaScript promises and async/await.
Node.js 20+ if you want to run the demo project locally with a dev server.

The Prompt API works on desktop Chrome only for now. Mobile support is on the roadmap, which is exactly why the hybrid fallback pattern in Step 7 matters for production apps.

What You'll Build

A small "smart notes" widget that runs entirely in the browser:

Detects whether Gemini Nano is available on the visitor's machine
Downloads the model with a progress bar on first use
Summarizes and rewrites notes with streaming output
Extracts structured to-do items from free text as validated JSON
Falls back to a server-side cloud model when on-device AI is unavailable

Step 1: Understand the Built-in AI Architecture

Chrome exposes several task-specific APIs on top of Gemini Nano — Summarizer, Translator, Language Detector, Writer, and Rewriter — plus the general-purpose Prompt API. The task APIs are simpler and more optimized, but the Prompt API is the one that gives you full conversational control, custom system prompts, and structured output. That is what we focus on here.

The key global is the LanguageModel interface. The lifecycle is always the same:

// 1. Check availability
const availability = await LanguageModel.availability();
// "unavailable" | "downloadable" | "downloading" | "available"
 
// 2. Create a session (triggers download if needed)
const session = await LanguageModel.create();
 
// 3. Prompt it
const result = await session.prompt("Explain HTTP caching in one sentence.");
console.log(result);
 
// 4. Free resources when done
session.destroy();

That is the entire mental model. Everything else in this tutorial is refinement of these four calls.

Step 2: Feature Detection Done Right

Never assume the API exists. Older browsers, mobile Chrome, Firefox, and Safari will all throw if you touch LanguageModel directly. Wrap detection in a helper:

// lib/ai-detect.js
export async function detectOnDeviceAI() {
  // The interface itself may not exist
  if (!("LanguageModel" in self)) {
    return { supported: false, reason: "api-missing" };
  }
 
  const availability = await LanguageModel.availability();
 
  switch (availability) {
    case "available":
      return { supported: true, ready: true };
    case "downloadable":
    case "downloading":
      return { supported: true, ready: false, needsDownload: true };
    default:
      // "unavailable": hardware or policy blocks the model
      return { supported: false, reason: "hardware-or-policy" };
  }
}

The distinction between the three positive states matters for UX:

available — the model is on disk, sessions create instantly
downloadable — supported hardware, but the user has not triggered the download yet
downloading — another tab or site already started fetching the model

Calling LanguageModel.create() when availability is downloadable starts a multi-gigabyte download. Always get an explicit user gesture (a button click) before triggering it, and show progress. Silently downloading gigabytes on page load is a fast way to anger users on metered connections.

Step 3: Create a Session with Download Progress

The create() call accepts a monitor callback that exposes download progress events. Wire it to a progress bar:

// lib/ai-session.js
export async function createSession(onProgress) {
  const session = await LanguageModel.create({
    monitor(m) {
      m.addEventListener("downloadprogress", (e) => {
        // e.loaded is a fraction between 0 and 1
        onProgress?.(Math.round(e.loaded * 100));
      });
    },
    initialPrompts: [
      {
        role: "system",
        content:
          "You are a concise writing assistant embedded in a notes app. " +
          "Answer in plain text without markdown headers.",
      },
    ],
  });
 
  return session;
}

Two details worth noting:

System prompts go in initialPrompts. The first entry with role system defines persistent behavior for the whole session. You can also seed few-shot examples by alternating user and assistant roles after it — the model treats them as prior conversation turns.

Sessions are stateful. Each prompt() call appends to the session's context window. For a notes widget this is what you want; for stateless operations, clone a fresh session from a base one instead (Step 6 covers this).

Step 4: Prompt with Streaming Output

A full response can take several seconds on modest hardware. Streaming makes the difference between an app that feels broken and one that feels alive. Use promptStreaming(), which returns a ReadableStream:

// lib/ai-actions.js
export async function summarizeStreaming(session, noteText, onChunk) {
  const stream = session.promptStreaming(
    `Summarize the following note in 2 sentences maximum:\n\n${noteText}`
  );
 
  let fullText = "";
  for await (const chunk of stream) {
    fullText += chunk;
    onChunk(fullText); // update the UI incrementally
  }
  return fullText;
}

Hooking it to the DOM is a one-liner in any framework. Vanilla example:

const output = document.querySelector("#summary");
summarizeBtn.addEventListener("click", async () => {
  output.textContent = "";
  await summarizeStreaming(session, noteArea.value, (text) => {
    output.textContent = text;
  });
});

If the user navigates away or clicks cancel, abort the stream with an AbortController — pass the signal in the options bag:

const controller = new AbortController();
const stream = session.promptStreaming(promptText, {
  signal: controller.signal,
});
// later, on cancel:
controller.abort();

Step 5: Structured JSON Output with a Schema

The killer feature for real apps: the Prompt API accepts a JSON Schema as a response constraint, and the model output is guaranteed to conform. This is how you extract to-do items from free-form notes without regex gymnastics:

const todoSchema = {
  type: "object",
  properties: {
    todos: {
      type: "array",
      items: {
        type: "object",
        properties: {
          task: { type: "string" },
          priority: { type: "string", enum: ["high", "medium", "low"] },
          dueMention: { type: "string" },
        },
        required: ["task", "priority"],
      },
    },
  },
  required: ["todos"],
};
 
export async function extractTodos(session, noteText) {
  const raw = await session.prompt(
    `Extract action items from this note:\n\n${noteText}`,
    { responseConstraint: todoSchema }
  );
  return JSON.parse(raw).todos;
}

Because the constraint is enforced at the sampling level, JSON.parse will not throw on malformed output — a dramatic reliability improvement over "please respond with JSON" prompting, and something even many cloud APIs still get wrong under load.

Keep schemas shallow. Deeply nested schemas slow down constrained sampling noticeably on-device. Two levels of nesting, as above, is a sweet spot.

Step 6: Manage Context with Session Cloning and Quotas

Gemini Nano's context window is small compared to cloud frontier models. The session object exposes usage accounting so you can react before hitting the wall:

console.log(session.inputUsage);  // tokens consumed so far
console.log(session.inputQuota);  // total tokens available
 
const remaining = session.inputQuota - session.inputUsage;
if (remaining < 1000) {
  // Context is nearly full — start fresh
}

For stateless operations (summarize this, rewrite that), avoid polluting one long-lived session. Create a base session once — paying the system-prompt cost a single time — and clone it per operation:

const baseSession = await createSession();
 
async function runIsolated(promptText) {
  const clone = await baseSession.clone();
  try {
    return await clone.prompt(promptText);
  } finally {
    clone.destroy();
  }
}

clone() copies the initial prompts but not the accumulated conversation, giving each operation a clean, cheap context. Always destroy() clones — on-device sessions hold GPU memory.

Step 7: The Hybrid Fallback Pattern

Production reality: a meaningful share of your users will be on mobile, on Firefox, or on machines that fail the hardware bar. The right architecture treats on-device AI as a progressive enhancement over a server route.

// lib/smart-ai.js
import { detectOnDeviceAI } from "./ai-detect.js";
 
let session = null;
 
export async function smartPrompt(promptText) {
  const status = await detectOnDeviceAI();
 
  if (status.supported && status.ready) {
    session ??= await LanguageModel.create();
    return {
      source: "on-device",
      text: await session.prompt(promptText),
    };
  }
 
  // Fallback: server route proxying a cloud model
  const res = await fetch("/api/ai", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ prompt: promptText }),
  });
  const data = await res.json();
  return { source: "cloud", text: data.text };
}

And the matching Next.js route handler using the Claude API as the cloud tier:

// app/api/ai/route.ts
import Anthropic from "@anthropic-ai/sdk";
 
const anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
 
export async function POST(req: Request) {
  const { prompt } = await req.json();
 
  const message = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 512,
    messages: [{ role: "user", content: prompt }],
  });
 
  const text = message.content
    .filter((block) => block.type === "text")
    .map((block) => block.text)
    .join("");
 
  return Response.json({ text });
}

This pattern gives you the best of both worlds: zero-cost, private, low-latency inference for capable clients, and universal coverage for everyone else. Surface the source field in your UI — a small "processed on your device" badge is a genuine trust signal, especially for privacy-sensitive audiences.

Step 8: Multimodal Prompts (Images and Audio)

Recent Chrome versions extend the Prompt API to accept image and audio input. Declare the expected input types when creating the session, then pass content parts:

const session = await LanguageModel.create({
  expectedInputs: [{ type: "image" }],
});
 
const fileInput = document.querySelector("#screenshot");
const file = fileInput.files[0];
 
const description = await session.prompt([
  {
    role: "user",
    content: [
      { type: "text", value: "Describe this screenshot for alt text." },
      { type: "image", value: file },
    ],
  },
]);

On-device image understanding unlocks features that were previously unthinkable for privacy reasons — auto-generating alt text for user photos, classifying screenshots, reading receipts — all without a single byte leaving the machine.

Testing Your Implementation

Verify each layer independently:

Availability states. In chrome://flags, the on-device model can be toggled; test your UI against unavailable, downloadable, and available. Also test in Firefox to confirm the fallback path engages.
Download UX. Clear the model via chrome://on-device-internals, reload, and confirm your progress bar renders during the fetch.
Structured output. Feed the to-do extractor deliberately messy notes ("maybe buy milk?? URGENT: call Sami friday") and assert the parsed array matches the schema.
Quota behavior. Loop a long prompt until inputUsage approaches inputQuota and confirm your session-reset logic fires.

Troubleshooting

LanguageModel.availability() returns unavailable on capable hardware. Check chrome://on-device-internals for the real reason — most often insufficient disk space (the 22 GB requirement is enforced with a safety margin) or an enterprise policy disabling generative AI features.

The first create() hangs forever. The download only proceeds while Chrome considers the network unmetered. On hotspots, Chrome may defer it silently. Surface a "waiting for Wi-Fi" hint if availability() stays at downloading.

Output quality feels below cloud models. It is — Gemini Nano is a small model. Keep tasks narrow (summarize, rewrite, extract, classify) and give it few-shot examples via initialPrompts. Reserve open-ended generation for your cloud tier.

QuotaExceededError on prompt. A single prompt exceeded the context window. Chunk long inputs, or route oversized requests straight to the cloud fallback.

Next Steps

Pair this with the task-specific Summarizer and Rewriter APIs for even faster, purpose-tuned operations
Read our guide on running Transformers.js models with WebGPU for cross-browser on-device AI that you fully control
Explore the Claude API for the cloud tier of your hybrid architecture
Add WebMCP tools so AI agents visiting your site can call your features directly

Conclusion

Chrome's built-in Prompt API turns the browser into an AI runtime. You learned how to detect availability without breaking unsupported browsers, download Gemini Nano with respectful UX, stream responses, enforce structured JSON output with schemas, manage context quotas with session cloning, handle images, and — most importantly — wrap it all in a hybrid pattern that degrades gracefully to a cloud model. On-device AI is not a replacement for cloud inference; it is a new, free, private tier in your stack. The apps that feel magical in 2026 are the ones using both.

Prerequisites

Before starting, ensure you have:

Chrome 138 or later on desktop (Windows 10/11, macOS 13+, Linux, or ChromeOS). The Prompt API graduated from origin trial and is available in stable Chrome for extensions and, progressively, for websites.
At least 22 GB of free disk space on the volume that contains your Chrome profile — Gemini Nano is downloaded once and shared across all sites.
A GPU with more than 4 GB of VRAM (integrated GPUs on recent machines qualify).
Basic knowledge of JavaScript promises and async/await.
Node.js 20+ if you want to run the demo project locally with a dev server.

The Prompt API works on desktop Chrome only for now. Mobile support is on the roadmap, which is exactly why the hybrid fallback pattern in Step 7 matters for production apps.

What You'll Build

A small "smart notes" widget that runs entirely in the browser:

Detects whether Gemini Nano is available on the visitor's machine
Downloads the model with a progress bar on first use
Summarizes and rewrites notes with streaming output
Extracts structured to-do items from free text as validated JSON
Falls back to a server-side cloud model when on-device AI is unavailable

Step 1: Understand the Built-in AI Architecture

The key global is the LanguageModel interface. The lifecycle is always the same:

// 1. Check availability
const availability = await LanguageModel.availability();
// "unavailable" | "downloadable" | "downloading" | "available"
 
// 2. Create a session (triggers download if needed)
const session = await LanguageModel.create();
 
// 3. Prompt it
const result = await session.prompt("Explain HTTP caching in one sentence.");
console.log(result);
 
// 4. Free resources when done
session.destroy();

That is the entire mental model. Everything else in this tutorial is refinement of these four calls.

Step 2: Feature Detection Done Right

Never assume the API exists. Older browsers, mobile Chrome, Firefox, and Safari will all throw if you touch LanguageModel directly. Wrap detection in a helper:

// lib/ai-detect.js
export async function detectOnDeviceAI() {
  // The interface itself may not exist
  if (!("LanguageModel" in self)) {
    return { supported: false, reason: "api-missing" };
  }
 
  const availability = await LanguageModel.availability();
 
  switch (availability) {
    case "available":
      return { supported: true, ready: true };
    case "downloadable":
    case "downloading":
      return { supported: true, ready: false, needsDownload: true };
    default:
      // "unavailable": hardware or policy blocks the model
      return { supported: false, reason: "hardware-or-policy" };
  }
}

The distinction between the three positive states matters for UX:

available — the model is on disk, sessions create instantly
downloadable — supported hardware, but the user has not triggered the download yet
downloading — another tab or site already started fetching the model

Step 3: Create a Session with Download Progress

The create() call accepts a monitor callback that exposes download progress events. Wire it to a progress bar:

// lib/ai-session.js
export async function createSession(onProgress) {
  const session = await LanguageModel.create({
    monitor(m) {
      m.addEventListener("downloadprogress", (e) => {
        // e.loaded is a fraction between 0 and 1
        onProgress?.(Math.round(e.loaded * 100));
      });
    },
    initialPrompts: [
      {
        role: "system",
        content:
          "You are a concise writing assistant embedded in a notes app. " +
          "Answer in plain text without markdown headers.",
      },
    ],
  });
 
  return session;
}

Two details worth noting:

Step 4: Prompt with Streaming Output

// lib/ai-actions.js
export async function summarizeStreaming(session, noteText, onChunk) {
  const stream = session.promptStreaming(
    `Summarize the following note in 2 sentences maximum:\n\n${noteText}`
  );
 
  let fullText = "";
  for await (const chunk of stream) {
    fullText += chunk;
    onChunk(fullText); // update the UI incrementally
  }
  return fullText;
}

Hooking it to the DOM is a one-liner in any framework. Vanilla example:

const output = document.querySelector("#summary");
summarizeBtn.addEventListener("click", async () => {
  output.textContent = "";
  await summarizeStreaming(session, noteArea.value, (text) => {
    output.textContent = text;
  });
});

If the user navigates away or clicks cancel, abort the stream with an AbortController — pass the signal in the options bag:

const controller = new AbortController();
const stream = session.promptStreaming(promptText, {
  signal: controller.signal,
});
// later, on cancel:
controller.abort();

Step 5: Structured JSON Output with a Schema

const todoSchema = {
  type: "object",
  properties: {
    todos: {
      type: "array",
      items: {
        type: "object",
        properties: {
          task: { type: "string" },
          priority: { type: "string", enum: ["high", "medium", "low"] },
          dueMention: { type: "string" },
        },
        required: ["task", "priority"],
      },
    },
  },
  required: ["todos"],
};
 
export async function extractTodos(session, noteText) {
  const raw = await session.prompt(
    `Extract action items from this note:\n\n${noteText}`,
    { responseConstraint: todoSchema }
  );
  return JSON.parse(raw).todos;
}

Keep schemas shallow. Deeply nested schemas slow down constrained sampling noticeably on-device. Two levels of nesting, as above, is a sweet spot.

Step 6: Manage Context with Session Cloning and Quotas

Gemini Nano's context window is small compared to cloud frontier models. The session object exposes usage accounting so you can react before hitting the wall:

console.log(session.inputUsage);  // tokens consumed so far
console.log(session.inputQuota);  // total tokens available
 
const remaining = session.inputQuota - session.inputUsage;
if (remaining < 1000) {
  // Context is nearly full — start fresh
}

const baseSession = await createSession();
 
async function runIsolated(promptText) {
  const clone = await baseSession.clone();
  try {
    return await clone.prompt(promptText);
  } finally {
    clone.destroy();
  }
}

clone() copies the initial prompts but not the accumulated conversation, giving each operation a clean, cheap context. Always destroy() clones — on-device sessions hold GPU memory.

Step 7: The Hybrid Fallback Pattern

// lib/smart-ai.js
import { detectOnDeviceAI } from "./ai-detect.js";
 
let session = null;
 
export async function smartPrompt(promptText) {
  const status = await detectOnDeviceAI();
 
  if (status.supported && status.ready) {
    session ??= await LanguageModel.create();
    return {
      source: "on-device",
      text: await session.prompt(promptText),
    };
  }
 
  // Fallback: server route proxying a cloud model
  const res = await fetch("/api/ai", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ prompt: promptText }),
  });
  const data = await res.json();
  return { source: "cloud", text: data.text };
}

And the matching Next.js route handler using the Claude API as the cloud tier:

// app/api/ai/route.ts
import Anthropic from "@anthropic-ai/sdk";
 
const anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
 
export async function POST(req: Request) {
  const { prompt } = await req.json();
 
  const message = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 512,
    messages: [{ role: "user", content: prompt }],
  });
 
  const text = message.content
    .filter((block) => block.type === "text")
    .map((block) => block.text)
    .join("");
 
  return Response.json({ text });
}

Step 8: Multimodal Prompts (Images and Audio)

Recent Chrome versions extend the Prompt API to accept image and audio input. Declare the expected input types when creating the session, then pass content parts:

const session = await LanguageModel.create({
  expectedInputs: [{ type: "image" }],
});
 
const fileInput = document.querySelector("#screenshot");
const file = fileInput.files[0];
 
const description = await session.prompt([
  {
    role: "user",
    content: [
      { type: "text", value: "Describe this screenshot for alt text." },
      { type: "image", value: file },
    ],
  },
]);

Testing Your Implementation

Verify each layer independently:

Availability states. In chrome://flags, the on-device model can be toggled; test your UI against unavailable, downloadable, and available. Also test in Firefox to confirm the fallback path engages.
Download UX. Clear the model via chrome://on-device-internals, reload, and confirm your progress bar renders during the fetch.
Structured output. Feed the to-do extractor deliberately messy notes ("maybe buy milk?? URGENT: call Sami friday") and assert the parsed array matches the schema.
Quota behavior. Loop a long prompt until inputUsage approaches inputQuota and confirm your session-reset logic fires.

Troubleshooting

QuotaExceededError on prompt. A single prompt exceeded the context window. Chunk long inputs, or route oversized requests straight to the cloud fallback.

Next Steps

Pair this with the task-specific Summarizer and Rewriter APIs for even faster, purpose-tuned operations
Read our guide on running Transformers.js models with WebGPU for cross-browser on-device AI that you fully control
Explore the Claude API for the cloud tier of your hybrid architecture
Add WebMCP tools so AI agents visiting your site can call your features directly

Chrome Built-in AI: Build On-Device AI Features with the Prompt API and Gemini Nano

Prerequisites

What You'll Build

Step 1: Understand the Built-in AI Architecture

Step 2: Feature Detection Done Right

Step 3: Create a Session with Download Progress

Step 4: Prompt with Streaming Output

Step 5: Structured JSON Output with a Schema

Step 6: Manage Context with Session Cloning and Quotas

Step 7: The Hybrid Fallback Pattern

Step 8: Multimodal Prompts (Images and Audio)

Testing Your Implementation

Troubleshooting

Next Steps

Conclusion

Chrome Built-in AI: Build On-Device AI Features with the Prompt API and Gemini Nano

Prerequisites

What You'll Build

Step 1: Understand the Built-in AI Architecture

Step 2: Feature Detection Done Right

Step 3: Create a Session with Download Progress

Step 4: Prompt with Streaming Output

Step 5: Structured JSON Output with a Schema

Step 6: Manage Context with Session Cloning and Quotas

Step 7: The Hybrid Fallback Pattern

Step 8: Multimodal Prompts (Images and Audio)

Testing Your Implementation

Troubleshooting

Next Steps

Conclusion