Most browser automation tools work like a tourist with a camera — they take a screenshot, send it to a vision model, guess pixel coordinates, and click. One CSS change breaks the guess. Alibaba's open-source Page Agent takes the opposite approach: it reads the DOM directly, the way a developer using the Elements panel would. The result is a TypeScript library that turns any web interface into something a language model can drive with plain sentences — no headless Chrome, no backend process, no image tokens.

This tutorial walks you through embedding Page Agent into a Next.js 15 application to build a floating AI copilot panel. By the end, a user types "Add a task called Deploy backend due Friday, high priority" and the agent fills the form and submits it — automatically.

Prerequisites

Before starting, ensure you have:

Node.js 20 or higher installed
Basic familiarity with React and TypeScript
An API key from an OpenAI-compatible provider (OpenAI, Anthropic, DeepSeek, or a local model via Ollama)
npm 10+ or pnpm 9+

What You'll Build

A Next.js 15 task management dashboard with a floating copilot panel powered by Page Agent. The copilot reads the live DOM of your app and interprets natural-language commands — filling form fields, clicking buttons, checking checkboxes — without any screenshots or coordinate guessing.

How Page Agent Works Under the Hood

Before writing code, it is worth understanding the architecture. When you call agent.execute(task), the Page Agent runtime:

Dehydrates the DOM into a compact text structure called a FlatDomTree — a flattened map of every interactive element and its semantic role.
Sends the tree plus the natural-language task to your configured LLM endpoint.
Receives a structured action from the model (click, type, scroll, select).
Applies the action to the real DOM node and loops until the task is complete.

No vision model is required because text inference on a compressed DOM is faster and cheaper than image inference on a screenshot. Element targeting uses real DOM references rather than pixel coordinates, so it remains precise even when layouts reflow.

Step 1: Create the Next.js Project

Scaffold a new Next.js 15 app with TypeScript and Tailwind:

npx create-next-app@latest page-agent-demo --typescript --app --tailwind
cd page-agent-demo

Install Page Agent:

npm install page-agent

Step 2: Configure Environment Variables

Create a .env.local file at the project root:

NEXT_PUBLIC_LLM_API_KEY=your_api_key_here
NEXT_PUBLIC_LLM_BASE_URL=https://api.openai.com/v1
NEXT_PUBLIC_LLM_MODEL=gpt-4o-mini

The NEXT_PUBLIC_ prefix exposes these values to the browser, which is required because Page Agent runs entirely client-side. We will cover how to move the API key server-side later in the CORS proxy section.

Step 3: Build the Task Dashboard

Create a task management page at app/dashboard/page.tsx. Note the explicit id attributes on each form element — Page Agent's DOM dehydration picks these up as semantic anchors, making targeting more reliable on complex forms.

"use client";
 
import { useState } from "react";
 
type Priority = "low" | "medium" | "high";
 
interface Task {
  id: string;
  title: string;
  dueDate: string;
  priority: Priority;
  done: boolean;
}
 
export default function Dashboard() {
  const [tasks, setTasks] = useState<Task[]>([]);
  const [title, setTitle] = useState("");
  const [dueDate, setDueDate] = useState("");
  const [priority, setPriority] = useState<Priority>("medium");
 
  function addTask() {
    if (!title.trim()) return;
    setTasks((prev) => [
      ...prev,
      { id: crypto.randomUUID(), title, dueDate, priority, done: false },
    ]);
    setTitle("");
    setDueDate("");
    setPriority("medium");
  }
 
  function toggleDone(id: string) {
    setTasks((prev) =>
      prev.map((t) => (t.id === id ? { ...t, done: !t.done } : t))
    );
  }
 
  return (
    <main className="max-w-2xl mx-auto p-8">
      <h1 className="text-2xl font-bold mb-6">Task Manager</h1>
 
      <section id="task-form" className="bg-gray-50 p-4 rounded-lg mb-8">
        <input
          id="task-title"
          placeholder="Task title"
          value={title}
          onChange={(e) => setTitle(e.target.value)}
          className="w-full border rounded p-2 mb-2"
        />
        <input
          id="task-due-date"
          type="date"
          value={dueDate}
          onChange={(e) => setDueDate(e.target.value)}
          className="w-full border rounded p-2 mb-2"
        />
        <select
          id="task-priority"
          value={priority}
          onChange={(e) => setPriority(e.target.value as Priority)}
          className="w-full border rounded p-2 mb-2"
        >
          <option value="low">Low</option>
          <option value="medium">Medium</option>
          <option value="high">High</option>
        </select>
        <button
          id="add-task-btn"
          onClick={addTask}
          className="w-full bg-blue-600 text-white rounded p-2"
        >
          Add Task
        </button>
      </section>
 
      <ul className="space-y-2">
        {tasks.map((task) => (
          <li key={task.id} className="border rounded p-3 flex items-center gap-3">
            <input
              type="checkbox"
              checked={task.done}
              onChange={() => toggleDone(task.id)}
            />
            <div className="flex-1">
              <p className={task.done ? "line-through text-gray-400" : ""}>
                {task.title}
              </p>
              <p className="text-xs text-gray-500">
                {task.dueDate} · {task.priority}
              </p>
            </div>
          </li>
        ))}
      </ul>
    </main>
  );
}

You do not have to add id attributes for Page Agent to work — it can infer elements from placeholder text, ARIA labels, and visible text — but explicit ids help on dense forms.

Step 4: Create the Copilot Component

Create components/Copilot.tsx. The singleton pattern for agentInstance ensures a single Page Agent object is reused across re-renders, preserving internal page state between consecutive commands.

"use client";
 
import { useState, useCallback } from "react";
import { PageAgent } from "page-agent";
 
let agentInstance: PageAgent | null = null;
 
function getAgent(): PageAgent {
  if (!agentInstance) {
    agentInstance = new PageAgent({
      model: process.env.NEXT_PUBLIC_LLM_MODEL ?? "gpt-4o-mini",
      baseURL: process.env.NEXT_PUBLIC_LLM_BASE_URL ?? "https://api.openai.com/v1",
      apiKey: process.env.NEXT_PUBLIC_LLM_API_KEY ?? "",
      language: "en-US",
    });
  }
  return agentInstance;
}
 
type Status = "idle" | "running" | "done" | "error";
 
export function Copilot() {
  const [input, setInput] = useState("");
  const [status, setStatus] = useState<Status>("idle");
  const [lastTask, setLastTask] = useState("");
 
  const runTask = useCallback(async () => {
    const task = input.trim();
    if (!task || status === "running") return;
 
    setStatus("running");
    setLastTask(task);
    setInput("");
 
    try {
      const agent = getAgent();
      await agent.execute(task);
      setStatus("done");
    } catch {
      setStatus("error");
    }
  }, [input, status]);
 
  const statusText =
    status === "idle"
      ? "Ready"
      : status === "running"
      ? "Working…"
      : status === "done"
      ? `Done: ${lastTask}`
      : "Something went wrong — try rephrasing the task";
 
  return (
    <div className="fixed bottom-4 right-4 w-80 bg-white shadow-xl rounded-xl p-4 border">
      <p className="text-xs font-semibold text-gray-500 mb-2">AI Copilot</p>
      <textarea
        rows={2}
        value={input}
        onChange={(e) => setInput(e.target.value)}
        placeholder="Tell the agent what to do…"
        className="w-full border rounded p-2 text-sm resize-none mb-2"
        onKeyDown={(e) => {
          if (e.key === "Enter" && !e.shiftKey) {
            e.preventDefault();
            runTask();
          }
        }}
      />
      <button
        onClick={runTask}
        disabled={status === "running"}
        className="w-full bg-indigo-600 text-white rounded p-2 text-sm disabled:opacity-50"
      >
        {status === "running" ? "Running…" : "Run"}
      </button>
      <p className="text-xs text-gray-400 mt-2">{statusText}</p>
    </div>
  );
}

Step 5: Mount the Copilot in the Dashboard

Update app/dashboard/page.tsx to import and render the Copilot. Add the import at the top of the file:

import { Copilot } from "@/components/Copilot";

Then add the component after the closing </main> tag inside the return statement:

  return (
    <>
      <main className="max-w-2xl mx-auto p-8">
        {/* existing dashboard markup */}
      </main>
      <Copilot />
    </>
  );

The copilot floats in the bottom-right corner over the full page.

Step 6: Use a Local Model with Ollama

If you prefer a fully local setup with no cloud API costs, Page Agent is compatible with any OpenAI-compatible server, including Ollama. Install Ollama and pull Qwen 2.5 7B — the model Alibaba's own documentation recommends for Page Agent tasks:

ollama pull qwen2.5:7b

Update your .env.local:

NEXT_PUBLIC_LLM_BASE_URL=http://localhost:11434/v1
NEXT_PUBLIC_LLM_API_KEY=ollama
NEXT_PUBLIC_LLM_MODEL=qwen2.5:7b

Qwen 2.5 7B runs comfortably on 8 GB of RAM, handles form-filling tasks accurately, and costs nothing per token. A frontier model is not required for most Page Agent use cases.

Step 7: Test the Copilot

Start the development server:

npm run dev

Open http://localhost:3000/dashboard. In the copilot panel, try these example commands:

"Add a task called Fix login bug due 2026-08-01 with high priority" — the agent fills all three form fields and clicks Add Task.
"Mark all tasks done" — the agent checks every checkbox in sequence.
"Add three tasks: Deploy API, Write tests, and Update docs, all medium priority" — multi-step execution across three form submissions.

Enable verbose logging by passing verbose: true to the PageAgent constructor to see each step printed in the browser console, including the compressed FlatDomTree sent to the model.

Step 8: Restrict the Agent's Scope

In production you may want to confine the agent to a specific section of the page to prevent accidental interactions with navigation or account settings. Pass a context DOM element:

const formEl = document.getElementById("task-form");
 
await agent.execute("Fill in the form with the task details", {
  context: formEl ?? undefined,
});

When a context element is provided, the FlatDomTree covers only that element's subtree. This reduces token usage and narrows the agent's field of action.

Step 9: Chain Multi-Step Workflows

Page Agent supports multi-step reasoning out of the box — the agent loops (act, observe, act) until it judges the task complete. For longer workflows you can also chain execute calls programmatically in your application code:

const agent = getAgent();
 
async function onboardNewUser(name: string, role: string) {
  await agent.execute(`Navigate to the Users section`);
  await agent.execute(`Click the Invite New User button`);
  await agent.execute(`Fill in the name field with "${name}" and set role to "${role}"`);
  await agent.execute(`Click Send Invitation`);
}

Each execute call inherits the agent's current view of the page state. You do not re-describe the DOM between steps — the agent re-reads it automatically at the start of each action.

You can also expose a sequence of tasks through the copilot UI by parsing a simple numbered list:

async function runSequence(tasks: string[]) {
  const agent = getAgent();
  for (const task of tasks) {
    await agent.execute(task);
  }
}
 
// Usage: pass ["Step 1: ...", "Step 2: ..."] from the textarea

This pattern is useful for wizard-style flows where each step must complete before the next one starts.

Step 10: Add a Server-Side CORS Proxy

Exposing your API key in NEXT_PUBLIC_ environment variables is acceptable in development but is a security risk in production. Cloud APIs (OpenAI, Anthropic) also block direct browser requests with CORS errors. The solution is a Next.js route handler that proxies the call server-side.

Create app/api/llm/route.ts (this was referenced as Step 9 above — now renumbered as Step 10):

import { NextRequest, NextResponse } from "next/server";
 
export async function POST(req: NextRequest) {
  const body = await req.json();
  const res = await fetch(
    `${process.env.LLM_BASE_URL}/chat/completions`,
    {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${process.env.LLM_API_KEY}`,
      },
      body: JSON.stringify(body),
    }
  );
  const data = await res.json();
  return NextResponse.json(data);
}

Move the real credentials to server-only variables (no NEXT_PUBLIC_ prefix) in .env.local:

LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=your_real_key_here

Then point the Page Agent client at the proxy:

NEXT_PUBLIC_LLM_BASE_URL=/api/llm
NEXT_PUBLIC_LLM_API_KEY=proxy
NEXT_PUBLIC_LLM_MODEL=gpt-4o-mini

Troubleshooting

The agent does not find my input. Add a visible label, placeholder, or id to the element. Page Agent reads semantic text; an unlabeled input with no placeholder is invisible to it.

Actions fire but React state does not update. Page Agent dispatches native browser events (click, input, change). If your components rely on non-standard event handling, ensure they respond to native DOM events.

Token costs are high. Switch to Qwen 2.5 7B via Ollama for free local inference, or use the context option to reduce FlatDomTree size by scoping the agent to a single section.

"Could not complete the task" error. Rephrase the command to be more specific. "Click the blue Add Task button" is more reliable than "submit the form" if the page has multiple submit elements.

Next Steps

Cross-tab automation — install the optional Page Agent Chrome extension to coordinate actions across multiple browser tabs.
Voice input — pipe Web Speech API transcriptions directly to agent.execute for a hands-free copilot.
Chain with Vercel AI SDK — use a chat interface for conversational responses and fall back to agent.execute for UI manipulation.
Explore @page-agent/core for lower-level control over FlatDomTree generation and action dispatch without the built-in UI panel.

Conclusion

Alibaba's Page Agent eliminates the infrastructure overhead that has historically made in-app AI copilots expensive to build. By reading the DOM instead of taking screenshots, it achieves precise, model-agnostic control in a package weighing under 50 kB with no backend process required. In under 30 minutes, you have a natural-language copilot running inside a real Next.js app — one that works with any OpenAI-compatible model, from GPT-4o-mini to a locally-hosted Qwen 2.5.