Most browser automation tools work like a tourist with a camera — they take a screenshot, send it to a vision model, guess pixel coordinates, and click. One CSS change breaks the guess. Alibaba's open-source Page Agent takes the opposite approach: it reads the DOM directly, the way a developer using the Elements panel would. The result is a TypeScript library that turns any web interface into something a language model can drive with plain sentences — no headless Chrome, no backend process, no image tokens.
This tutorial walks you through embedding Page Agent into a Next.js 15 application to build a floating AI copilot panel. By the end, a user types "Add a task called Deploy backend due Friday, high priority" and the agent fills the form and submits it — automatically.
Prerequisites
Before starting, ensure you have:
- Node.js 20 or higher installed
- Basic familiarity with React and TypeScript
- An API key from an OpenAI-compatible provider (OpenAI, Anthropic, DeepSeek, or a local model via Ollama)
- npm 10+ or pnpm 9+
What You'll Build
A Next.js 15 task management dashboard with a floating copilot panel powered by Page Agent. The copilot reads the live DOM of your app and interprets natural-language commands — filling form fields, clicking buttons, checking checkboxes — without any screenshots or coordinate guessing.
How Page Agent Works Under the Hood
Before writing code, it is worth understanding the architecture. When you call agent.execute(task), the Page Agent runtime:
- Dehydrates the DOM into a compact text structure called a FlatDomTree — a flattened map of every interactive element and its semantic role.
- Sends the tree plus the natural-language task to your configured LLM endpoint.
- Receives a structured action from the model (click, type, scroll, select).
- Applies the action to the real DOM node and loops until the task is complete.
No vision model is required because text inference on a compressed DOM is faster and cheaper than image inference on a screenshot. Element targeting uses real DOM references rather than pixel coordinates, so it remains precise even when layouts reflow.
Step 1: Create the Next.js Project
Scaffold a new Next.js 15 app with TypeScript and Tailwind:
npx create-next-app@latest page-agent-demo --typescript --app --tailwind
cd page-agent-demoInstall Page Agent:
npm install page-agentStep 2: Configure Environment Variables
Create a .env.local file at the project root:
NEXT_PUBLIC_LLM_API_KEY=your_api_key_here
NEXT_PUBLIC_LLM_BASE_URL=https://api.openai.com/v1
NEXT_PUBLIC_LLM_MODEL=gpt-4o-miniThe NEXT_PUBLIC_ prefix exposes these values to the browser, which is required because Page Agent runs entirely client-side. We will cover how to move the API key server-side later in the CORS proxy section.
Step 3: Build the Task Dashboard
Create a task management page at app/dashboard/page.tsx. Note the explicit id attributes on each form element — Page Agent's DOM dehydration picks these up as semantic anchors, making targeting more reliable on complex forms.
"use client";
import { useState } from "react";
type Priority = "low" | "medium" | "high";
interface Task {
id: string;
title: string;
dueDate: string;
priority: Priority;
done: boolean;
}
export default function Dashboard() {
const [tasks, setTasks] = useState<Task[]>([]);
const [title, setTitle] = useState("");
const [dueDate, setDueDate] = useState("");
const [priority, setPriority] = useState<Priority>("medium");
function addTask() {
if (!title.trim()) return;
setTasks((prev) => [
...prev,
{ id: crypto.randomUUID(), title, dueDate, priority, done: false },
]);
setTitle("");
setDueDate("");
setPriority("medium");
}
function toggleDone(id: string) {
setTasks((prev) =>
prev.map((t) => (t.id === id ? { ...t, done: !t.done } : t))
);
}
return (
<main className="max-w-2xl mx-auto p-8">
<h1 className="text-2xl font-bold mb-6">Task Manager</h1>
<section id="task-form" className="bg-gray-50 p-4 rounded-lg mb-8">
<input
id="task-title"
placeholder="Task title"
value={title}
onChange={(e) => setTitle(e.target.value)}
className="w-full border rounded p-2 mb-2"
/>
<input
id="task-due-date"
type="date"
value={dueDate}
onChange={(e) => setDueDate(e.target.value)}
className="w-full border rounded p-2 mb-2"
/>
<select
id="task-priority"
value={priority}
onChange={(e) => setPriority(e.target.value as Priority)}
className="w-full border rounded p-2 mb-2"
>
<option value="low">Low</option>
<option value="medium">Medium</option>
<option value="high">High</option>
</select>
<button
id="add-task-btn"
onClick={addTask}
className="w-full bg-blue-600 text-white rounded p-2"
>
Add Task
</button>
</section>
<ul className="space-y-2">
{tasks.map((task) => (
<li key={task.id} className="border rounded p-3 flex items-center gap-3">
<input
type="checkbox"
checked={task.done}
onChange={() => toggleDone(task.id)}
/>
<div className="flex-1">
<p className={task.done ? "line-through text-gray-400" : ""}>
{task.title}
</p>
<p className="text-xs text-gray-500">
{task.dueDate} · {task.priority}
</p>
</div>
</li>
))}
</ul>
</main>
);
}You do not have to add id attributes for Page Agent to work — it can infer elements from placeholder text, ARIA labels, and visible text — but explicit ids help on dense forms.
Step 4: Create the Copilot Component
Create components/Copilot.tsx. The singleton pattern for agentInstance ensures a single Page Agent object is reused across re-renders, preserving internal page state between consecutive commands.
"use client";
import { useState, useCallback } from "react";
import { PageAgent } from "page-agent";
let agentInstance: PageAgent | null = null;
function getAgent(): PageAgent {
if (!agentInstance) {
agentInstance = new PageAgent({
model: process.env.NEXT_PUBLIC_LLM_MODEL ?? "gpt-4o-mini",
baseURL: process.env.NEXT_PUBLIC_LLM_BASE_URL ?? "https://api.openai.com/v1",
apiKey: process.env.NEXT_PUBLIC_LLM_API_KEY ?? "",
language: "en-US",
});
}
return agentInstance;
}
type Status = "idle" | "running" | "done" | "error";
export function Copilot() {
const [input, setInput] = useState("");
const [status, setStatus] = useState<Status>("idle");
const [lastTask, setLastTask] = useState("");
const runTask = useCallback(async () => {
const task = input.trim();
if (!task || status === "running") return;
setStatus("running");
setLastTask(task);
setInput("");
try {
const agent = getAgent();
await agent.execute(task);
setStatus("done");
} catch {
setStatus("error");
}
}, [input, status]);
const statusText =
status === "idle"
? "Ready"
: status === "running"
? "Working…"
: status === "done"
? `Done: ${lastTask}`
: "Something went wrong — try rephrasing the task";
return (
<div className="fixed bottom-4 right-4 w-80 bg-white shadow-xl rounded-xl p-4 border">
<p className="text-xs font-semibold text-gray-500 mb-2">AI Copilot</p>
<textarea
rows={2}
value={input}
onChange={(e) => setInput(e.target.value)}
placeholder="Tell the agent what to do…"
className="w-full border rounded p-2 text-sm resize-none mb-2"
onKeyDown={(e) => {
if (e.key === "Enter" && !e.shiftKey) {
e.preventDefault();
runTask();
}
}}
/>
<button
onClick={runTask}
disabled={status === "running"}
className="w-full bg-indigo-600 text-white rounded p-2 text-sm disabled:opacity-50"
>
{status === "running" ? "Running…" : "Run"}
</button>
<p className="text-xs text-gray-400 mt-2">{statusText}</p>
</div>
);
}Step 5: Mount the Copilot in the Dashboard
Update app/dashboard/page.tsx to import and render the Copilot. Add the import at the top of the file:
import { Copilot } from "@/components/Copilot";Then add the component after the closing </main> tag inside the return statement:
return (
<>
<main className="max-w-2xl mx-auto p-8">
{/* existing dashboard markup */}
</main>
<Copilot />
</>
);The copilot floats in the bottom-right corner over the full page.
Step 6: Use a Local Model with Ollama
If you prefer a fully local setup with no cloud API costs, Page Agent is compatible with any OpenAI-compatible server, including Ollama. Install Ollama and pull Qwen 2.5 7B — the model Alibaba's own documentation recommends for Page Agent tasks:
ollama pull qwen2.5:7bUpdate your .env.local:
NEXT_PUBLIC_LLM_BASE_URL=http://localhost:11434/v1
NEXT_PUBLIC_LLM_API_KEY=ollama
NEXT_PUBLIC_LLM_MODEL=qwen2.5:7bQwen 2.5 7B runs comfortably on 8 GB of RAM, handles form-filling tasks accurately, and costs nothing per token. A frontier model is not required for most Page Agent use cases.
Step 7: Test the Copilot
Start the development server:
npm run devOpen http://localhost:3000/dashboard. In the copilot panel, try these example commands:
- "Add a task called Fix login bug due 2026-08-01 with high priority" — the agent fills all three form fields and clicks Add Task.
- "Mark all tasks done" — the agent checks every checkbox in sequence.
- "Add three tasks: Deploy API, Write tests, and Update docs, all medium priority" — multi-step execution across three form submissions.
Enable verbose logging by passing verbose: true to the PageAgent constructor to see each step printed in the browser console, including the compressed FlatDomTree sent to the model.
Step 8: Restrict the Agent's Scope
In production you may want to confine the agent to a specific section of the page to prevent accidental interactions with navigation or account settings. Pass a context DOM element:
const formEl = document.getElementById("task-form");
await agent.execute("Fill in the form with the task details", {
context: formEl ?? undefined,
});When a context element is provided, the FlatDomTree covers only that element's subtree. This reduces token usage and narrows the agent's field of action.
Step 9: Chain Multi-Step Workflows
Page Agent supports multi-step reasoning out of the box — the agent loops (act, observe, act) until it judges the task complete. For longer workflows you can also chain execute calls programmatically in your application code:
const agent = getAgent();
async function onboardNewUser(name: string, role: string) {
await agent.execute(`Navigate to the Users section`);
await agent.execute(`Click the Invite New User button`);
await agent.execute(`Fill in the name field with "${name}" and set role to "${role}"`);
await agent.execute(`Click Send Invitation`);
}Each execute call inherits the agent's current view of the page state. You do not re-describe the DOM between steps — the agent re-reads it automatically at the start of each action.
You can also expose a sequence of tasks through the copilot UI by parsing a simple numbered list:
async function runSequence(tasks: string[]) {
const agent = getAgent();
for (const task of tasks) {
await agent.execute(task);
}
}
// Usage: pass ["Step 1: ...", "Step 2: ..."] from the textareaThis pattern is useful for wizard-style flows where each step must complete before the next one starts.
Step 10: Add a Server-Side CORS Proxy
Exposing your API key in NEXT_PUBLIC_ environment variables is acceptable in development but is a security risk in production. Cloud APIs (OpenAI, Anthropic) also block direct browser requests with CORS errors. The solution is a Next.js route handler that proxies the call server-side.
Create app/api/llm/route.ts (this was referenced as Step 9 above — now renumbered as Step 10):
import { NextRequest, NextResponse } from "next/server";
export async function POST(req: NextRequest) {
const body = await req.json();
const res = await fetch(
`${process.env.LLM_BASE_URL}/chat/completions`,
{
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.LLM_API_KEY}`,
},
body: JSON.stringify(body),
}
);
const data = await res.json();
return NextResponse.json(data);
}Move the real credentials to server-only variables (no NEXT_PUBLIC_ prefix) in .env.local:
LLM_BASE_URL=https://api.openai.com/v1
LLM_API_KEY=your_real_key_hereThen point the Page Agent client at the proxy:
NEXT_PUBLIC_LLM_BASE_URL=/api/llm
NEXT_PUBLIC_LLM_API_KEY=proxy
NEXT_PUBLIC_LLM_MODEL=gpt-4o-miniTroubleshooting
The agent does not find my input. Add a visible label, placeholder, or id to the element. Page Agent reads semantic text; an unlabeled input with no placeholder is invisible to it.
Actions fire but React state does not update. Page Agent dispatches native browser events (click, input, change). If your components rely on non-standard event handling, ensure they respond to native DOM events.
Token costs are high. Switch to Qwen 2.5 7B via Ollama for free local inference, or use the context option to reduce FlatDomTree size by scoping the agent to a single section.
"Could not complete the task" error. Rephrase the command to be more specific. "Click the blue Add Task button" is more reliable than "submit the form" if the page has multiple submit elements.
Next Steps
- Cross-tab automation — install the optional Page Agent Chrome extension to coordinate actions across multiple browser tabs.
- Voice input — pipe Web Speech API transcriptions directly to
agent.executefor a hands-free copilot. - Chain with Vercel AI SDK — use a chat interface for conversational responses and fall back to
agent.executefor UI manipulation. - Explore
@page-agent/corefor lower-level control over FlatDomTree generation and action dispatch without the built-in UI panel.
Conclusion
Alibaba's Page Agent eliminates the infrastructure overhead that has historically made in-app AI copilots expensive to build. By reading the DOM instead of taking screenshots, it achieves precise, model-agnostic control in a package weighing under 50 kB with no backend process required. In under 30 minutes, you have a natural-language copilot running inside a real Next.js app — one that works with any OpenAI-compatible model, from GPT-4o-mini to a locally-hosted Qwen 2.5.