Build an AI-Powered Web Scraper with Playwright and Claude API in TypeScript

Build an AI-Powered Web Scraper with Playwright and Claude API in TypeScript
Traditional web scrapers break every time a website updates its HTML. You write CSS selectors, the site changes a class name, and your pipeline is dead. Sound familiar?
There's a better way. Instead of telling your scraper exactly where the data lives, you let an AI model understand the page and extract what you need — structured, clean, and resilient to layout changes.
In this tutorial, you'll build a production-ready web scraper that combines Playwright for headless browser automation with Anthropic's Claude API for intelligent data extraction. By the end, you'll have a TypeScript CLI tool that can scrape any website and return structured JSON — no matter how the HTML is laid out.
What You'll Build
A CLI tool called ai-scraper that:
- Navigates to any URL using a real headless browser (handles JavaScript-rendered content)
- Captures the page content and optionally takes screenshots
- Sends the content to Claude with a structured extraction prompt
- Returns clean, typed JSON data
- Supports pagination and multi-page scraping
- Includes retry logic and rate limiting
Prerequisites
Before starting, make sure you have:
- Node.js 20+ installed (check with
node --version) - TypeScript basics (types, interfaces, async/await)
- Anthropic API key — get one at console.anthropic.com
- Basic familiarity with the command line
- About 30 minutes of focused time
Step 1: Initialize the Project
Create a new directory and initialize the project:
mkdir ai-scraper && cd ai-scraper
npm init -yInstall the dependencies:
npm install playwright @anthropic-ai/sdk zod commander dotenv
npm install -D typescript @types/node tsxHere's what each package does:
| Package | Purpose |
|---|---|
playwright | Headless browser automation |
@anthropic-ai/sdk | Claude API client |
zod | Runtime schema validation |
commander | CLI argument parsing |
dotenv | Environment variable loading |
tsx | TypeScript execution without compilation |
Install the Playwright browsers:
npx playwright install chromiumInitialize TypeScript:
npx tsc --initUpdate your tsconfig.json:
{
"compilerOptions": {
"target": "ES2022",
"module": "ESNext",
"moduleResolution": "bundler",
"strict": true,
"esModuleInterop": true,
"outDir": "./dist",
"rootDir": "./src",
"declaration": true,
"resolveJsonModule": true,
"skipLibCheck": true
},
"include": ["src/**/*"]
}Create the project structure:
mkdir -p src/{extractors,utils}
touch .env src/index.ts src/scraper.ts src/ai-extractor.tsStep 2: Configure Environment Variables
Add your Anthropic API key to .env:
ANTHROPIC_API_KEY=sk-ant-your-key-here
MAX_RETRIES=3
RATE_LIMIT_MS=1000⚠️ Never commit your
.envfile. Add it to.gitignoreimmediately.
Step 3: Define the Types
Create src/types.ts — the backbone of your type-safe scraper:
import { z } from "zod";
// Schema for a scraped item — customize per use case
export const ScrapedItemSchema = z.object({
title: z.string(),
description: z.string().optional(),
price: z.string().optional(),
url: z.string().url().optional(),
imageUrl: z.string().url().optional(),
metadata: z.record(z.string()).optional(),
});
export type ScrapedItem = z.infer<typeof ScrapedItemSchema>;
// Schema for the full extraction result
export const ExtractionResultSchema = z.object({
items: z.array(ScrapedItemSchema),
totalFound: z.number(),
pageInfo: z.object({
title: z.string(),
url: z.string(),
scrapedAt: z.string(),
}),
});
export type ExtractionResult = z.infer<typeof ExtractionResultSchema>;
// Configuration for the scraper
export interface ScraperConfig {
url: string;
prompt: string;
schema?: z.ZodSchema;
waitForSelector?: string;
maxPages?: number;
screenshot?: boolean;
timeout?: number;
}
// Browser page content
export interface PageContent {
html: string;
text: string;
url: string;
title: string;
screenshot?: Buffer;
}Zod gives you runtime validation. This is critical — Claude's output is a string that should be JSON, but you need to verify the structure before trusting it.
Step 4: Build the Browser Automation Layer
Create src/scraper.ts:
import { chromium, Browser, Page } from "playwright";
import type { PageContent, ScraperConfig } from "./types";
export class BrowserScraper {
private browser: Browser | null = null;
async launch(): Promise<void> {
this.browser = await chromium.launch({
headless: true,
args: [
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-dev-shm-usage",
],
});
}
async scrape(config: ScraperConfig): Promise<PageContent> {
if (!this.browser) {
throw new Error("Browser not launched. Call launch() first.");
}
const context = await this.browser.newContext({
userAgent:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
"AppleWebKit/537.36 (KHTML, like Gecko) " +
"Chrome/122.0.0.0 Safari/537.36",
viewport: { width: 1280, height: 720 },
});
const page = await context.newPage();
try {
// Navigate with timeout
await page.goto(config.url, {
waitUntil: "networkidle",
timeout: config.timeout || 30000,
});
// Wait for a specific selector if provided
if (config.waitForSelector) {
await page.waitForSelector(config.waitForSelector, {
timeout: 10000,
});
}
// Auto-scroll to trigger lazy-loaded content
await this.autoScroll(page);
// Extract content
const content = await this.extractContent(page);
// Optional screenshot
let screenshot: Buffer | undefined;
if (config.screenshot) {
screenshot = await page.screenshot({
fullPage: true,
type: "png",
});
}
return {
...content,
screenshot,
};
} finally {
await context.close();
}
}
private async autoScroll(page: Page): Promise<void> {
await page.evaluate(async () => {
await new Promise<void>((resolve) => {
let totalHeight = 0;
const distance = 300;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
// Safety timeout: stop scrolling after 10 seconds
setTimeout(() => {
clearInterval(timer);
resolve();
}, 10000);
});
});
}
private async extractContent(page: Page): Promise<Omit<PageContent, "screenshot">> {
const title = await page.title();
const url = page.url();
// Get clean text content (strips scripts, styles, hidden elements)
const text = await page.evaluate(() => {
const scripts = document.querySelectorAll(
"script, style, noscript, iframe"
);
scripts.forEach((el) => el.remove());
return document.body.innerText || "";
});
// Get the raw HTML for structure analysis
const html = await page.evaluate(() => {
return document.body.innerHTML;
});
return { html, text, url, title };
}
async close(): Promise<void> {
if (this.browser) {
await this.browser.close();
this.browser = null;
}
}
}Key design decisions:
networkidlewait strategy — ensures JavaScript-rendered content is loaded- Auto-scroll — triggers lazy-loaded content (common on modern sites)
- Clean text extraction — removes scripts and styles before sending to Claude
- Context isolation — each scrape gets a fresh browser context (clean cookies, no cross-contamination)
Step 5: Build the AI Extraction Layer
This is where the magic happens. Create src/ai-extractor.ts:
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
import type { PageContent, ExtractionResult } from "./types";
import { ExtractionResultSchema } from "./types";
export class AIExtractor {
private client: Anthropic;
private model = "claude-sonnet-4-20250514";
constructor(apiKey: string) {
this.client = new Anthropic({ apiKey });
}
async extract(
content: PageContent,
userPrompt: string,
schema?: z.ZodSchema
): Promise<ExtractionResult> {
const systemPrompt = `You are a precise data extraction assistant. Your job is to analyze web page content and extract structured data based on the user's request.
Rules:
1. Return ONLY valid JSON — no markdown, no explanations, no code fences.
2. Extract ALL matching items from the page content.
3. If a field is not found, use null instead of making up data.
4. URLs should be absolute (resolve relative URLs using the page URL).
5. Clean up text: remove extra whitespace, fix encoding issues.
6. Be thorough — scan the entire content, not just the visible portion.`;
const userMessage = `
Page URL: ${content.url}
Page Title: ${content.title}
--- PAGE CONTENT START ---
${this.truncateContent(content.text, 80000)}
--- PAGE CONTENT END ---
--- HTML STRUCTURE (first 20000 chars) ---
${this.truncateContent(content.html, 20000)}
--- HTML STRUCTURE END ---
EXTRACTION REQUEST: ${userPrompt}
Return the data as JSON matching this exact structure:
{
"items": [
{
"title": "string",
"description": "string or null",
"price": "string or null",
"url": "absolute URL or null",
"imageUrl": "absolute URL or null",
"metadata": { "key": "value" }
}
],
"totalFound": number,
"pageInfo": {
"title": "page title",
"url": "page url",
"scrapedAt": "ISO date string"
}
}`;
const response = await this.client.messages.create({
model: this.model,
max_tokens: 4096,
system: systemPrompt,
messages: [{ role: "user", content: userMessage }],
});
const responseText = response.content
.filter((block) => block.type === "text")
.map((block) => block.text)
.join("");
return this.parseResponse(responseText, schema);
}
private parseResponse(
text: string,
schema?: z.ZodSchema
): ExtractionResult {
let cleaned = text.trim();
// Remove markdown code fences if present
if (cleaned.startsWith("\`\`\`")) {
cleaned = cleaned.replace(/^\`\`\`(?:json)?\n?/, "").replace(/\n?\`\`\`$/, "");
}
let parsed: unknown;
try {
parsed = JSON.parse(cleaned);
} catch (error) {
throw new Error(
`Failed to parse AI response as JSON: ${(error as Error).message}\n` +
`Response preview: ${cleaned.substring(0, 200)}`
);
}
const validationSchema = schema || ExtractionResultSchema;
const result = validationSchema.safeParse(parsed);
if (!result.success) {
throw new Error(
`AI response validation failed:\n${result.error.issues
.map((i) => ` - ${i.path.join(".")}: ${i.message}`)
.join("\n")}`
);
}
return result.data as ExtractionResult;
}
private truncateContent(text: string, maxChars: number): string {
if (text.length <= maxChars) return text;
return (
text.substring(0, maxChars) +
`\n\n[... truncated ${text.length - maxChars} characters ...]`
);
}
}🚀 Need help implementing AI-powered automation for your business? Noqta builds intelligent solutions for teams who want results, not experiments.
A few things worth noting:
- Content truncation — Claude has a large context window, but we're still smart about token usage. Text gets 80K chars, HTML gets 20K for structural hints.
- Zod validation — the AI's response is validated against a schema. If Claude returns unexpected structure, you get a clear error instead of silent corruption.
- Response cleaning — sometimes Claude wraps JSON in markdown code fences despite being told not to. We handle that gracefully.
Step 6: Add Retry Logic and Rate Limiting
Create src/utils/retry.ts:
export interface RetryConfig {
maxRetries: number;
baseDelayMs: number;
maxDelayMs: number;
}
const DEFAULT_CONFIG: RetryConfig = {
maxRetries: 3,
baseDelayMs: 1000,
maxDelayMs: 10000,
};
export async function withRetry<T>(
fn: () => Promise<T>,
config: Partial<RetryConfig> = {}
): Promise<T> {
const { maxRetries, baseDelayMs, maxDelayMs } = {
...DEFAULT_CONFIG,
...config,
};
let lastError: Error | undefined;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
if (attempt === maxRetries) break;
const delay = Math.min(
baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000,
maxDelayMs
);
console.warn(
`Attempt ${attempt + 1} failed: ${lastError.message}. ` +
`Retrying in ${Math.round(delay)}ms...`
);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw lastError;
}Create src/utils/rate-limiter.ts:
export class RateLimiter {
private lastCall = 0;
constructor(private minIntervalMs: number) {}
async wait(): Promise<void> {
const now = Date.now();
const elapsed = now - this.lastCall;
if (elapsed < this.minIntervalMs) {
const waitTime = this.minIntervalMs - elapsed;
await new Promise((resolve) => setTimeout(resolve, waitTime));
}
this.lastCall = Date.now();
}
}Step 7: Build the Multi-Page Scraper
Create src/multi-page-scraper.ts — handles pagination:
import { BrowserScraper } from "./scraper";
import { AIExtractor } from "./ai-extractor";
import { RateLimiter } from "./utils/rate-limiter";
import { withRetry } from "./utils/retry";
import type { ScraperConfig, ExtractionResult, ScrapedItem } from "./types";
export class MultiPageScraper {
private browserScraper: BrowserScraper;
private aiExtractor: AIExtractor;
private rateLimiter: RateLimiter;
constructor(apiKey: string, rateLimitMs = 1000) {
this.browserScraper = new BrowserScraper();
this.aiExtractor = new AIExtractor(apiKey);
this.rateLimiter = new RateLimiter(rateLimitMs);
}
async scrapeMultiplePages(
configs: ScraperConfig[]
): Promise<ExtractionResult> {
await this.browserScraper.launch();
const allItems: ScrapedItem[] = [];
let totalFound = 0;
try {
for (const [index, config] of configs.entries()) {
console.log(
`\n📄 Scraping page ${index + 1}/${configs.length}: ${config.url}`
);
await this.rateLimiter.wait();
const result = await withRetry(async () => {
const content = await this.browserScraper.scrape(config);
console.log(
` ✓ Page loaded (${content.text.length} chars text)`
);
const extraction = await this.aiExtractor.extract(
content,
config.prompt,
config.schema
);
console.log(
` ✓ Extracted ${extraction.items.length} items`
);
return extraction;
});
allItems.push(...result.items);
totalFound += result.totalFound;
}
} finally {
await this.browserScraper.close();
}
return {
items: allItems,
totalFound,
pageInfo: {
title: `Multi-page scrape (${configs.length} pages)`,
url: configs[0]?.url || "",
scrapedAt: new Date().toISOString(),
},
};
}
}Step 8: Create the CLI Interface
Create src/index.ts:
import "dotenv/config";
import { Command } from "commander";
import { BrowserScraper } from "./scraper";
import { AIExtractor } from "./ai-extractor";
import { MultiPageScraper } from "./multi-page-scraper";
import { withRetry } from "./utils/retry";
import { writeFileSync } from "fs";
const program = new Command();
program
.name("ai-scraper")
.description("AI-powered web scraper using Playwright and Claude")
.version("1.0.0");
program
.command("scrape")
.description("Scrape a single URL")
.requiredOption("-u, --url <url>", "URL to scrape")
.requiredOption("-p, --prompt <prompt>", "What to extract")
.option("-o, --output <file>", "Output JSON file")
.option("-s, --screenshot", "Take a full-page screenshot")
.option("--wait <selector>", "CSS selector to wait for")
.option("--timeout <ms>", "Navigation timeout in ms", "30000")
.action(async (options) => {
const apiKey = process.env.ANTHROPIC_API_KEY;
if (!apiKey) {
console.error("❌ ANTHROPIC_API_KEY not set in .env");
process.exit(1);
}
const scraper = new BrowserScraper();
const extractor = new AIExtractor(apiKey);
try {
console.log(`🌐 Navigating to ${options.url}...`);
await scraper.launch();
const content = await withRetry(() =>
scraper.scrape({
url: options.url,
prompt: options.prompt,
waitForSelector: options.wait,
screenshot: options.screenshot,
timeout: parseInt(options.timeout),
})
);
console.log(`📝 Page loaded: "${content.title}"`);
console.log(` Text: ${content.text.length} chars`);
console.log(` HTML: ${content.html.length} chars`);
if (content.screenshot) {
const screenshotPath = "screenshot.png";
writeFileSync(screenshotPath, content.screenshot);
console.log(`📸 Screenshot saved: ${screenshotPath}`);
}
console.log(`\n🤖 Sending to Claude for extraction...`);
const result = await withRetry(() =>
extractor.extract(content, options.prompt)
);
console.log(`✅ Extracted ${result.items.length} items\n`);
const output = JSON.stringify(result, null, 2);
if (options.output) {
writeFileSync(options.output, output);
console.log(`💾 Saved to ${options.output}`);
} else {
console.log(output);
}
} catch (error) {
console.error(`❌ Error: ${(error as Error).message}`);
process.exit(1);
} finally {
await scraper.close();
}
});
program
.command("multi")
.description("Scrape multiple URLs")
.requiredOption("-u, --urls <urls...>", "URLs to scrape (space-separated)")
.requiredOption("-p, --prompt <prompt>", "What to extract")
.option("-o, --output <file>", "Output JSON file")
.action(async (options) => {
const apiKey = process.env.ANTHROPIC_API_KEY;
if (!apiKey) {
console.error("❌ ANTHROPIC_API_KEY not set in .env");
process.exit(1);
}
const multiScraper = new MultiPageScraper(
apiKey,
parseInt(process.env.RATE_LIMIT_MS || "1000")
);
try {
const configs = options.urls.map((url: string) => ({
url,
prompt: options.prompt,
}));
const result = await multiScraper.scrapeMultiplePages(configs);
const output = JSON.stringify(result, null, 2);
if (options.output) {
writeFileSync(options.output, output);
console.log(`\n💾 Saved ${result.items.length} items to ${options.output}`);
} else {
console.log(output);
}
} catch (error) {
console.error(`❌ Error: ${(error as Error).message}`);
process.exit(1);
}
});
program.parse();Step 9: Add the Package Scripts
Update your package.json:
{
"type": "module",
"scripts": {
"scrape": "tsx src/index.ts scrape",
"multi": "tsx src/index.ts multi",
"build": "tsc",
"start": "node dist/index.js"
}
}Step 10: Test It — Real-World Examples
Example 1: Scrape product listings
npx tsx src/index.ts scrape \
-u "https://books.toscrape.com" \
-p "Extract all book titles, prices, ratings, and availability" \
-o books.jsonExpected output:
{
"items": [
{
"title": "A Light in the Attic",
"price": "£51.77",
"metadata": {
"rating": "Three",
"availability": "In stock"
}
},
{
"title": "Tipping the Velvet",
"price": "£53.74",
"metadata": {
"rating": "One",
"availability": "In stock"
}
}
],
"totalFound": 20,
"pageInfo": {
"title": "All products | Books to Scrape",
"url": "https://books.toscrape.com",
"scrapedAt": "2026-03-13T12:00:00.000Z"
}
}Example 2: Scrape job listings
npx tsx src/index.ts scrape \
-u "https://news.ycombinator.com/jobs" \
-p "Extract all job postings: company name, job title, posting date, and link" \
-o hn-jobs.jsonExample 3: Multi-page scrape
npx tsx src/index.ts multi \
-u "https://books.toscrape.com/catalogue/page-1.html" \
"https://books.toscrape.com/catalogue/page-2.html" \
"https://books.toscrape.com/catalogue/page-3.html" \
-p "Extract all book titles and prices" \
-o all-books.jsonStep 11: Add Custom Extraction Schemas
The real power comes from custom schemas. Create src/extractors/product-extractor.ts:
import { z } from "zod";
export const ProductSchema = z.object({
items: z.array(
z.object({
name: z.string(),
price: z.object({
amount: z.number(),
currency: z.string(),
}),
rating: z.number().min(0).max(5).optional(),
reviewCount: z.number().optional(),
availability: z.enum(["in_stock", "out_of_stock", "preorder"]),
sku: z.string().optional(),
brand: z.string().optional(),
category: z.string().optional(),
url: z.string().url(),
imageUrl: z.string().url().optional(),
})
),
totalFound: z.number(),
pageInfo: z.object({
title: z.string(),
url: z.string(),
scrapedAt: z.string(),
}),
});
export type ProductResult = z.infer<typeof ProductSchema>;Use it in your scrape call:
import { ProductSchema } from "./extractors/product-extractor";
const result = await extractor.extract(content, userPrompt, ProductSchema);
// result is now fully typed with ProductResult shape💡 Ready to go from reading to building? Talk to our team about implementing custom automation and API integrations for your business.
Tips and Best Practices
✅ Do
- Use
networkidlefor JavaScript-heavy sites — it waits for all network requests to settle - Validate with Zod — never trust raw AI output in production
- Implement rate limiting — be a good citizen; don't hammer target servers
- Cache results — save raw HTML alongside extracted data for debugging
- Use screenshots for debugging extraction issues
⚠️ Watch Out For
- Robots.txt — always check and respect a site's crawling rules
- Terms of Service — ensure scraping is permitted
- Rate limits — both for target sites AND your Anthropic API
- Token costs — large pages mean more tokens. Monitor your API usage
- Dynamic content — some sites require specific interactions (clicks, scrolls) before content appears
❌ Don't
- Scrape personal data without consent
- Bypass authentication mechanisms
- Ignore rate limits or
robots.txt - Use this for spam or automated content generation without attribution
Architecture Overview
Here's how the components fit together:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ CLI Layer │────▶│ Playwright │────▶│ Target Site │
│ (Commander) │ │ (Browser) │ │ │
└──────┬───────┘ └──────┬───────┘ └──────────────┘
│ │
│ PageContent
│ (html + text)
│ │
│ ┌──────▼───────┐ ┌──────────────┐
│ │ AI Extractor │────▶│ Claude API │
│ │ (Anthropic) │ │ │
│ └──────┬───────┘ └──────────────┘
│ │
│ Structured JSON
│ (Zod-validated)
│ │
▼ ▼
┌──────────────────────────────┐
│ Output (stdout or .json) │
└──────────────────────────────┘
Going Further
Here are some ideas to extend this project:
- Add a caching layer — store raw HTML in SQLite so you can re-extract without re-scraping
- Build a web UI — wrap the CLI in a Next.js app with a form interface
- Schedule scrapes — use cron jobs or a task queue to scrape on a schedule
- Add proxy support — rotate proxies for large-scale scraping
- Stream results — use Claude's streaming API for real-time extraction feedback
- Vision extraction — send screenshots to Claude's vision capability for layout-dependent data
Summary
You've built a production-ready AI-powered web scraper that:
- Uses Playwright for reliable browser automation (handles JS-rendered content)
- Leverages Claude for intelligent, schema-free data extraction
- Validates output with Zod for type safety
- Supports multi-page scraping with rate limiting
- Includes retry logic with exponential backoff
The key insight: instead of writing fragile CSS selectors that break on every site update, you describe what you want in natural language and let AI figure out where it lives. This approach is more resilient, more flexible, and surprisingly more accurate than traditional scraping.
The full source code is available to adapt for your specific use cases. Happy scraping — and remember to be ethical about it.
Building automation tools is one thing. Building them right — resilient, type-safe, production-ready — is another. At Noqta, we help teams implement AI-powered solutions that actually work in production. Let's talk about your next project.
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.
Related Articles

Integrating OpenAI Reasoning Models into GitHub Pull Requests
Learn how to integrate OpenAI reasoning models into your GitHub Pull Request workflow to automatically review code for quality, security, and enterprise standards compliance.

Introduction to Model Context Protocol (MCP)
Learn about the Model Context Protocol (MCP), its use cases, advantages, and how to build and use an MCP server with TypeScript.

Orchestrating Agents: Routines and Handoffs
Learn how to orchestrate multiple agents using routines and handoffs for efficient task management.