Every AI tutorial you have read this year probably started the same way: get an API key, set up billing, send your users' data to a third-party server. This one does the opposite. By the end of this tutorial, you will have a Next.js app that runs real machine learning models — text embeddings, a chat LLM, and Whisper speech recognition — entirely inside the user's browser, accelerated by the GPU through WebGPU.

The library making this possible is Transformers.js from Hugging Face. Now at version 4, it runs ONNX-converted models on WebAssembly or WebGPU, supports over 150 model architectures, and serves more than a million unique users every month. The WebGPU backend delivers inference that is often 10 to 100 times faster than the WASM fallback, fast enough to make small LLMs genuinely usable on consumer laptops.

Why does this matter? Three reasons that hit especially hard if you build for users in the MENA region:

Zero per-token cost. The model downloads once, caches in the browser, and every inference after that is free. No usage bills that scale with your traffic.
Total privacy. Voice recordings, documents, and queries never leave the device. For healthcare, legal, or government use cases with strict data-residency requirements, this is not a nice-to-have — it is the only architecture that qualifies.
Offline resilience. Once cached, models work on flaky connections, on trains, and behind restrictive firewalls.

Prerequisites

Before starting, make sure you have:

Node.js 20+ and npm or pnpm installed
Basic knowledge of React and Next.js (App Router)
A WebGPU-capable browser: Chrome or Edge 113+, or recent Firefox/Safari versions with WebGPU enabled
A machine with a GPU (integrated graphics are fine — Apple Silicon works beautifully)

No Hugging Face account, no API keys, and no Python required.

What You'll Build

A single-page "Private AI Toolbox" with three tabs, each demonstrating a different pipeline:

Semantic search — type notes, embed them with a sentence-transformer model, and search by meaning instead of keywords.
Local chat — a streaming conversation with Qwen2.5-0.5B-Instruct, a quantized LLM running on your GPU.
Voice transcription — record audio and transcribe it with Whisper, entirely offline.

All inference runs in a Web Worker so the UI never freezes, and every model is cached by the browser after the first download.

Step 1: Project Setup

Create a fresh Next.js project and install Transformers.js:

npx create-next-app@latest browser-ai --typescript --app --tailwind
cd browser-ai
npm install @huggingface/transformers

Note the package name: the modern library lives at @huggingface/transformers. The old @xenova/transformers package is the legacy v2 line — avoid it for new projects.

Transformers.js ships with Node.js bindings (onnxruntime-node, sharp) that must not be bundled into client code. Tell webpack to skip them in next.config.js:

/** @type {import('next').NextConfig} */
const nextConfig = {
  webpack: (config) => {
    config.resolve.alias = {
      ...config.resolve.alias,
      sharp$: false,
      "onnxruntime-node$": false,
    };
    return config;
  },
};
 
module.exports = nextConfig;

Without this, the build fails with cryptic errors about native modules. This is the single most common setup mistake with Transformers.js in Next.js.

Step 2: Understand Devices and Quantization

Two options control how a model runs, and getting them right is the difference between a snappy app and a frozen tab.

device selects the backend:

"wasm" — WebAssembly on the CPU. Works everywhere, slowest.
"webgpu" — GPU acceleration. Massively faster, but requires browser support.

dtype selects the quantization level — how aggressively the model weights are compressed:

dtype	Precision	Typical use
`fp32`	Full 32-bit	WebGPU default, largest download
`fp16`	Half precision	Good WebGPU balance
`q8`	8-bit quantized	WASM default, small and accurate
`q4`	4-bit quantized	LLMs in the browser — smallest download

For a 0.5-billion-parameter LLM, q4 brings the download to roughly 350 MB — large for a web asset, but it downloads once and the browser caches it permanently.

Detecting WebGPU support takes one line. Create lib/gpu.ts:

export async function detectWebGPU(): Promise<boolean> {
  if (!("gpu" in navigator)) return false;
  try {
    const adapter = await (navigator as any).gpu.requestAdapter();
    return adapter !== null;
  } catch {
    return false;
  }
}

We will use this to pick webgpu when available and fall back to wasm otherwise, so the app works for every visitor.

Step 3: The Web Worker and Pipeline Singleton

Model inference is heavy. If you run it on the main thread, typing lags, animations stutter, and the browser may show a "page unresponsive" warning. The fix is the canonical Transformers.js pattern: a Web Worker that owns the models, plus a singleton so each pipeline loads exactly once.

Create app/worker.ts:

import {
  pipeline,
  TextStreamer,
  env,
} from "@huggingface/transformers";
 
// Models always come from the Hugging Face Hub (cached after first load)
env.allowLocalModels = false;
 
// One singleton per task: load once, reuse forever
class Pipelines {
  static instances: Record<string, Promise<any>> = {};
 
  static get(task: string, model: string, options: object = {}) {
    const key = `${task}:${model}`;
    if (!(key in this.instances)) {
      this.instances[key] = pipeline(task as any, model, {
        ...options,
        progress_callback: (p: any) =>
          self.postMessage({ status: "progress", task, data: p }),
      });
    }
    return this.instances[key];
  }
}
 
self.addEventListener("message", async (event: MessageEvent) => {
  const { type, payload, device } = event.data;
 
  try {
    switch (type) {
      case "embed": {
        const extractor = await Pipelines.get(
          "feature-extraction",
          "mixedbread-ai/mxbai-embed-xsmall-v1",
          { device },
        );
        const output = await extractor(payload.texts, {
          pooling: "mean",
          normalize: true,
        });
        self.postMessage({
          status: "complete",
          type: "embed",
          data: output.tolist(),
        });
        break;
      }
 
      case "chat": {
        const generator = await Pipelines.get(
          "text-generation",
          "onnx-community/Qwen2.5-0.5B-Instruct",
          { device, dtype: "q4" },
        );
        const streamer = new TextStreamer(generator.tokenizer, {
          skip_prompt: true,
          callback_function: (text: string) =>
            self.postMessage({ status: "token", data: text }),
        });
        const result = await generator(payload.messages, {
          max_new_tokens: 512,
          do_sample: false,
          streamer,
        });
        self.postMessage({
          status: "complete",
          type: "chat",
          data: result[0].generated_text.at(-1).content,
        });
        break;
      }
 
      case "transcribe": {
        const transcriber = await Pipelines.get(
          "automatic-speech-recognition",
          "onnx-community/whisper-tiny.en",
          { device },
        );
        const output = await transcriber(payload.audio);
        self.postMessage({
          status: "complete",
          type: "transcribe",
          data: output.text,
        });
        break;
      }
    }
  } catch (err: any) {
    self.postMessage({ status: "error", data: err.message });
  }
});

Three things to notice:

progress_callback fires during model download with file names and percentages — essential UX for a 350 MB download.
TextStreamer posts each generated token back to the UI as it is produced, so chat feels live instead of frozen-then-dumped.
The singleton map means switching tabs never re-downloads or re-initializes a model.

Step 4: The React Hook

Now a hook that talks to the worker. Create app/useAI.ts:

"use client";
 
import { useEffect, useRef, useState, useCallback } from "react";
import { detectWebGPU } from "@/lib/gpu";
 
export function useAI() {
  const worker = useRef<Worker | null>(null);
  const [device, setDevice] = useState<"webgpu" | "wasm">("wasm");
  const [progress, setProgress] = useState<string>("");
  const [streamText, setStreamText] = useState<string>("");
  const resolvers = useRef<Map<string, (data: any) => void>>(new Map());
 
  useEffect(() => {
    detectWebGPU().then((ok) => setDevice(ok ? "webgpu" : "wasm"));
 
    worker.current = new Worker(new URL("./worker.ts", import.meta.url), {
      type: "module",
    });
 
    worker.current.addEventListener("message", (e: MessageEvent) => {
      const { status, type, data } = e.data;
      if (status === "progress" && data.status === "progress") {
        setProgress(`${data.file}: ${Math.round(data.progress)}%`);
      } else if (status === "token") {
        setStreamText((prev) => prev + data);
      } else if (status === "complete") {
        setProgress("");
        resolvers.current.get(type)?.(data);
      }
    });
 
    return () => worker.current?.terminate();
  }, []);
 
  const run = useCallback(
    (type: string, payload: object): Promise<any> => {
      setStreamText("");
      return new Promise((resolve) => {
        resolvers.current.set(type, resolve);
        worker.current?.postMessage({ type, payload, device });
      });
    },
    [device],
  );
 
  return { run, device, progress, streamText };
}

The hook exposes four things: a promise-based run() for any task, the detected device, download progress for loading indicators, and streamText that accumulates chat tokens in real time.

Step 5: Semantic Search with Embeddings

The embedding model converts text into 384-dimensional vectors where similar meanings land close together. Since the worker already normalizes the vectors, cosine similarity reduces to a dot product. Create app/components/SemanticSearch.tsx:

"use client";
 
import { useState } from "react";
 
function dot(a: number[], b: number[]): number {
  return a.reduce((sum, v, i) => sum + v * b[i], 0);
}
 
export function SemanticSearch({ run }: { run: Function }) {
  const [notes, setNotes] = useState<string[]>([
    "The quarterly invoice for the Tunis office is due Friday",
    "Couscous recipe: steam twice, never boil",
    "WebGPU shaders compile asynchronously in Chrome",
  ]);
  const [query, setQuery] = useState("");
  const [results, setResults] = useState<
    Array<[string, number]>
  >([]);
 
  async function search() {
    const vectors: number[][] = await run("embed", {
      texts: [query, ...notes],
    });
    const [queryVec, ...noteVecs] = vectors;
    const scored = notes
      .map((note, i): [string, number] => [note, dot(queryVec, noteVecs[i])])
      .sort((a, b) => b[1] - a[1]);
    setResults(scored);
  }
 
  return (
    <div className="space-y-4">
      <input
        className="w-full rounded border p-2"
        value={query}
        onChange={(e) => setQuery(e.target.value)}
        placeholder="Search by meaning, e.g. 'cooking instructions'"
      />
      <button onClick={search} className="rounded bg-blue-600 px-4 py-2 text-white">
        Search
      </button>
      <ul>
        {results.map(([note, score]) => (
          <li key={note} className="border-b py-2">
            <span className="font-mono text-sm text-gray-500">
              {score.toFixed(3)}
            </span>{" "}
            {note}
          </li>
        ))}
      </ul>
    </div>
  );
}

Try searching "cooking instructions" — the couscous note ranks first even though it shares zero keywords with the query. That is semantic search, running locally, in milliseconds, on a model that downloaded in seconds.

Step 6: Streaming Chat with a Local LLM

The chat tab sends a message history and renders tokens as they stream in. Create app/components/LocalChat.tsx:

"use client";
 
import { useState } from "react";
 
type Message = { role: string; content: string };
 
export function LocalChat({
  run,
  streamText,
}: {
  run: Function;
  streamText: string;
}) {
  const [messages, setMessages] = useState<Message[]>([
    { role: "system", content: "You are a concise, helpful assistant." },
  ]);
  const [input, setInput] = useState("");
  const [busy, setBusy] = useState(false);
 
  async function send() {
    const next = [...messages, { role: "user", content: input }];
    setMessages(next);
    setInput("");
    setBusy(true);
    const reply: string = await run("chat", { messages: next });
    setMessages([...next, { role: "assistant", content: reply }]);
    setBusy(false);
  }
 
  return (
    <div className="space-y-4">
      {messages
        .filter((m) => m.role !== "system")
        .map((m, i) => (
          <p key={i} className={m.role === "user" ? "text-right" : ""}>
            <strong>{m.role}:</strong> {m.content}
          </p>
        ))}
      {busy && streamText && (
        <p>
          <strong>assistant:</strong> {streamText}
        </p>
      )}
      <div className="flex gap-2">
        <input
          className="flex-1 rounded border p-2"
          value={input}
          onChange={(e) => setInput(e.target.value)}
          onKeyDown={(e) => e.key === "Enter" && !busy && send()}
        />
        <button
          onClick={send}
          disabled={busy}
          className="rounded bg-blue-600 px-4 py-2 text-white disabled:opacity-50"
        >
          Send
        </button>
      </div>
    </div>
  );
}

The first message triggers the q4 model download (roughly 350 MB — show the progress string from the hook prominently). After that, the model loads from cache in a couple of seconds, and on WebGPU a half-billion-parameter model generates at a very usable pace on ordinary laptops.

Qwen2.5-0.5B is a small model — expect competent summaries, rewrites, and Q&A, not deep reasoning. For stronger quality, swap in a larger ONNX community model and keep the same code; only the model ID and download size change.

Step 7: Voice Transcription with Whisper

The final tab records microphone audio and feeds it to Whisper. The key detail: Whisper expects 16 kHz mono Float32 audio, so we decode the recording with an AudioContext pinned to 16000 Hz. Create app/components/Transcriber.tsx:

"use client";
 
import { useRef, useState } from "react";
 
export function Transcriber({ run }: { run: Function }) {
  const recorder = useRef<MediaRecorder | null>(null);
  const chunks = useRef<Blob[]>([]);
  const [recording, setRecording] = useState(false);
  const [text, setText] = useState("");
 
  async function start() {
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    recorder.current = new MediaRecorder(stream);
    chunks.current = [];
    recorder.current.ondataavailable = (e) => chunks.current.push(e.data);
    recorder.current.onstop = async () => {
      const blob = new Blob(chunks.current);
      const ctx = new AudioContext({ sampleRate: 16000 });
      const buffer = await ctx.decodeAudioData(await blob.arrayBuffer());
      const audio = buffer.getChannelData(0); // mono Float32Array
      const result: string = await run("transcribe", { audio });
      setText(result);
      stream.getTracks().forEach((t) => t.stop());
    };
    recorder.current.start();
    setRecording(true);
  }
 
  function stop() {
    recorder.current?.stop();
    setRecording(false);
  }
 
  return (
    <div className="space-y-4">
      <button
        onClick={recording ? stop : start}
        className={`rounded px-4 py-2 text-white ${
          recording ? "bg-red-600" : "bg-blue-600"
        }`}
      >
        {recording ? "Stop" : "Record"}
      </button>
      {text && <p className="rounded bg-gray-100 p-4">{text}</p>}
    </div>
  );
}

whisper-tiny.en is only about 40 MB and transcribes short clips in well under a second on WebGPU. Your users' voice never touches a server — a meaningful guarantee for medical dictation, legal notes, or any regulated workflow.

Finally, wire the three components into app/page.tsx with simple tab state and pass them run, streamText, and render the device and progress indicators in a header. The hook handles everything else.

Testing Your Implementation

Run npm run dev and open Chrome 113+.
The header should read webgpu — if it says wasm, check chrome://gpu for WebGPU status.
In the search tab, the embedding model (about 30 MB) downloads with visible progress; queries return ranked results.
In DevTools, go to Application, then Cache storage — you will see transformers-cache holding the ONNX files. Reload the page: models now initialize from cache with no network traffic.
Toggle DevTools network throttling to Offline after the first load — everything still works. That is the whole point.

Troubleshooting

Build fails mentioning sharp or onnxruntime-node. You skipped the webpack aliases in Step 1. They are mandatory for client-side use.

navigator.gpu is undefined. The browser lacks WebGPU, or the page is served over plain HTTP. WebGPU requires a secure context — localhost counts, but a LAN IP does not.

First chat reply takes minutes. That is the one-time q4 download. Surface the progress callback in the UI so users see download percentages instead of a dead button.

Out-of-memory crash on mobile. A 350 MB LLM is too much for many phones. Detect memory with navigator.deviceMemory and gate the chat tab, or offer a smaller model.

Garbled transcription. Your audio is not 16 kHz mono. Always decode through new AudioContext({ sampleRate: 16000 }) and pass channel 0.

Next Steps

Add on-device RAG: chunk documents, embed them, store vectors in IndexedDB, and feed top matches into the chat prompt — a fully private retrieval pipeline.
Try multilingual Whisper (onnx-community/whisper-small) to transcribe Arabic and French audio for MENA users.
Explore the WebGPU guide and the onnx-community organization on Hugging Face for hundreds of ready-converted models.
Pair this with our Ollama local AI chatbot tutorial to compare browser inference with self-hosted server inference, or our Vercel AI Gateway guide for the hybrid cloud route.

Conclusion

You built a Next.js app that embeds, chats, and transcribes without a single server-side inference call. The architecture is small and repeatable: a Web Worker owning singleton pipelines, WebGPU with WASM fallback, quantized models cached by the browser, and a promise-based hook bridging worker and UI.

Browser-side AI will not replace frontier cloud models for hard reasoning. But for embeddings, transcription, classification, and light generation — the tasks that make up most production AI features — it offers something no API can: zero marginal cost, total privacy, and offline operation. For developers serving markets where data sovereignty and bandwidth costs are daily constraints, that is not a demo trick. It is an architecture worth shipping.