Build an AI-Powered Web Scraper with Playwright and Claude API in TypeScript

Traditional web scrapers break every time a website updates its HTML. You write CSS selectors, the site changes a class name, and your pipeline is dead. Sound familiar?

There's a better way. Instead of telling your scraper exactly where the data lives, you let an AI model understand the page and extract what you need — structured, clean, and resilient to layout changes.

In this tutorial, you'll build a production-ready web scraper that combines Playwright for headless browser automation with Anthropic's Claude API for intelligent data extraction. By the end, you'll have a TypeScript CLI tool that can scrape any website and return structured JSON — no matter how the HTML is laid out.

What You'll Build

A CLI tool called ai-scraper that:

Navigates to any URL using a real headless browser (handles JavaScript-rendered content)
Captures the page content and optionally takes screenshots
Sends the content to Claude with a structured extraction prompt
Returns clean, typed JSON data
Supports pagination and multi-page scraping
Includes retry logic and rate limiting

Prerequisites

Before starting, make sure you have:

Node.js 20+ installed (check with node --version)
TypeScript basics (types, interfaces, async/await)
Anthropic API key — get one at console.anthropic.com
Basic familiarity with the command line
About 30 minutes of focused time

Step 1: Initialize the Project

Create a new directory and initialize the project:

mkdir ai-scraper && cd ai-scraper
npm init -y

Install the dependencies:

npm install playwright @anthropic-ai/sdk zod commander dotenv
npm install -D typescript @types/node tsx

Here's what each package does:

Package	Purpose
`playwright`	Headless browser automation
`@anthropic-ai/sdk`	Claude API client
`zod`	Runtime schema validation
`commander`	CLI argument parsing
`dotenv`	Environment variable loading
`tsx`	TypeScript execution without compilation

Install the Playwright browsers:

npx playwright install chromium

Initialize TypeScript:

npx tsc --init

Update your tsconfig.json:

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "ESNext",
    "moduleResolution": "bundler",
    "strict": true,
    "esModuleInterop": true,
    "outDir": "./dist",
    "rootDir": "./src",
    "declaration": true,
    "resolveJsonModule": true,
    "skipLibCheck": true
  },
  "include": ["src/**/*"]
}

Create the project structure:

mkdir -p src/{extractors,utils}
touch .env src/index.ts src/scraper.ts src/ai-extractor.ts

Step 2: Configure Environment Variables

Add your Anthropic API key to .env:

ANTHROPIC_API_KEY=sk-ant-your-key-here
MAX_RETRIES=3
RATE_LIMIT_MS=1000

⚠️ Never commit your .env file. Add it to .gitignore immediately.

Step 3: Define the Types

Create src/types.ts — the backbone of your type-safe scraper:

import { z } from "zod";
 
// Schema for a scraped item — customize per use case
export const ScrapedItemSchema = z.object({
  title: z.string(),
  description: z.string().optional(),
  price: z.string().optional(),
  url: z.string().url().optional(),
  imageUrl: z.string().url().optional(),
  metadata: z.record(z.string()).optional(),
});
 
export type ScrapedItem = z.infer<typeof ScrapedItemSchema>;
 
// Schema for the full extraction result
export const ExtractionResultSchema = z.object({
  items: z.array(ScrapedItemSchema),
  totalFound: z.number(),
  pageInfo: z.object({
    title: z.string(),
    url: z.string(),
    scrapedAt: z.string(),
  }),
});
 
export type ExtractionResult = z.infer<typeof ExtractionResultSchema>;
 
// Configuration for the scraper
export interface ScraperConfig {
  url: string;
  prompt: string;
  schema?: z.ZodSchema;
  waitForSelector?: string;
  maxPages?: number;
  screenshot?: boolean;
  timeout?: number;
}
 
// Browser page content
export interface PageContent {
  html: string;
  text: string;
  url: string;
  title: string;
  screenshot?: Buffer;
}

Zod gives you runtime validation. This is critical — Claude's output is a string that should be JSON, but you need to verify the structure before trusting it.

Step 4: Build the Browser Automation Layer

Create src/scraper.ts:

import { chromium, Browser, Page } from "playwright";
import type { PageContent, ScraperConfig } from "./types";
 
export class BrowserScraper {
  private browser: Browser | null = null;
 
  async launch(): Promise<void> {
    this.browser = await chromium.launch({
      headless: true,
      args: [
        "--no-sandbox",
        "--disable-setuid-sandbox",
        "--disable-dev-shm-usage",
      ],
    });
  }
 
  async scrape(config: ScraperConfig): Promise<PageContent> {
    if (!this.browser) {
      throw new Error("Browser not launched. Call launch() first.");
    }
 
    const context = await this.browser.newContext({
      userAgent:
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " +
        "AppleWebKit/537.36 (KHTML, like Gecko) " +
        "Chrome/122.0.0.0 Safari/537.36",
      viewport: { width: 1280, height: 720 },
    });
 
    const page = await context.newPage();
 
    try {
      // Navigate with timeout
      await page.goto(config.url, {
        waitUntil: "networkidle",
        timeout: config.timeout || 30000,
      });
 
      // Wait for a specific selector if provided
      if (config.waitForSelector) {
        await page.waitForSelector(config.waitForSelector, {
          timeout: 10000,
        });
      }
 
      // Auto-scroll to trigger lazy-loaded content
      await this.autoScroll(page);
 
      // Extract content
      const content = await this.extractContent(page);
 
      // Optional screenshot
      let screenshot: Buffer | undefined;
      if (config.screenshot) {
        screenshot = await page.screenshot({
          fullPage: true,
          type: "png",
        });
      }
 
      return {
        ...content,
        screenshot,
      };
    } finally {
      await context.close();
    }
  }
 
  private async autoScroll(page: Page): Promise<void> {
    await page.evaluate(async () => {
      await new Promise<void>((resolve) => {
        let totalHeight = 0;
        const distance = 300;
        const timer = setInterval(() => {
          const scrollHeight = document.body.scrollHeight;
          window.scrollBy(0, distance);
          totalHeight += distance;
 
          if (totalHeight >= scrollHeight) {
            clearInterval(timer);
            resolve();
          }
        }, 100);
 
        // Safety timeout: stop scrolling after 10 seconds
        setTimeout(() => {
          clearInterval(timer);
          resolve();
        }, 10000);
      });
    });
  }
 
  private async extractContent(page: Page): Promise<Omit<PageContent, "screenshot">> {
    const title = await page.title();
    const url = page.url();
 
    // Get clean text content (strips scripts, styles, hidden elements)
    const text = await page.evaluate(() => {
      const scripts = document.querySelectorAll(
        "script, style, noscript, iframe"
      );
      scripts.forEach((el) => el.remove());
      return document.body.innerText || "";
    });
 
    // Get the raw HTML for structure analysis
    const html = await page.evaluate(() => {
      return document.body.innerHTML;
    });
 
    return { html, text, url, title };
  }
 
  async close(): Promise<void> {
    if (this.browser) {
      await this.browser.close();
      this.browser = null;
    }
  }
}

Key design decisions:

networkidle wait strategy — ensures JavaScript-rendered content is loaded
Auto-scroll — triggers lazy-loaded content (common on modern sites)
Clean text extraction — removes scripts and styles before sending to Claude
Context isolation — each scrape gets a fresh browser context (clean cookies, no cross-contamination)

Step 5: Build the AI Extraction Layer

This is where the magic happens. Create src/ai-extractor.ts:

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
import type { PageContent, ExtractionResult } from "./types";
import { ExtractionResultSchema } from "./types";
 
export class AIExtractor {
  private client: Anthropic;
  private model = "claude-sonnet-4-20250514";
 
  constructor(apiKey: string) {
    this.client = new Anthropic({ apiKey });
  }
 
  async extract(
    content: PageContent,
    userPrompt: string,
    schema?: z.ZodSchema
  ): Promise<ExtractionResult> {
    const systemPrompt = `You are a precise data extraction assistant. Your job is to analyze web page content and extract structured data based on the user's request.
 
Rules:
1. Return ONLY valid JSON — no markdown, no explanations, no code fences.
2. Extract ALL matching items from the page content.
3. If a field is not found, use null instead of making up data.
4. URLs should be absolute (resolve relative URLs using the page URL).
5. Clean up text: remove extra whitespace, fix encoding issues.
6. Be thorough — scan the entire content, not just the visible portion.`;
 
    const userMessage = `
Page URL: ${content.url}
Page Title: ${content.title}
 
--- PAGE CONTENT START ---
${this.truncateContent(content.text, 80000)}
--- PAGE CONTENT END ---
 
--- HTML STRUCTURE (first 20000 chars) ---
${this.truncateContent(content.html, 20000)}
--- HTML STRUCTURE END ---
 
EXTRACTION REQUEST: ${userPrompt}
 
Return the data as JSON matching this exact structure:
{
  "items": [
    {
      "title": "string",
      "description": "string or null",
      "price": "string or null",
      "url": "absolute URL or null",
      "imageUrl": "absolute URL or null",
      "metadata": { "key": "value" }
    }
  ],
  "totalFound": number,
  "pageInfo": {
    "title": "page title",
    "url": "page url",
    "scrapedAt": "ISO date string"
  }
}`;
 
    const response = await this.client.messages.create({
      model: this.model,
      max_tokens: 4096,
      system: systemPrompt,
      messages: [{ role: "user", content: userMessage }],
    });
 
    const responseText = response.content
      .filter((block) => block.type === "text")
      .map((block) => block.text)
      .join("");
 
    return this.parseResponse(responseText, schema);
  }
 
  private parseResponse(
    text: string,
    schema?: z.ZodSchema
  ): ExtractionResult {
    let cleaned = text.trim();
 
    // Remove markdown code fences if present
    if (cleaned.startsWith("\`\`\`")) {
      cleaned = cleaned.replace(/^\`\`\`(?:json)?\n?/, "").replace(/\n?\`\`\`$/, "");
    }
 
    let parsed: unknown;
    try {
      parsed = JSON.parse(cleaned);
    } catch (error) {
      throw new Error(
        `Failed to parse AI response as JSON: ${(error as Error).message}\n` +
        `Response preview: ${cleaned.substring(0, 200)}`
      );
    }
 
    const validationSchema = schema || ExtractionResultSchema;
    const result = validationSchema.safeParse(parsed);
 
    if (!result.success) {
      throw new Error(
        `AI response validation failed:\n${result.error.issues
          .map((i) => `  - ${i.path.join(".")}: ${i.message}`)
          .join("\n")}`
      );
    }
 
    return result.data as ExtractionResult;
  }
 
  private truncateContent(text: string, maxChars: number): string {
    if (text.length <= maxChars) return text;
    return (
      text.substring(0, maxChars) +
      `\n\n[... truncated ${text.length - maxChars} characters ...]`
    );
  }
}

🚀 Need help implementing AI-powered automation for your business? Noqta builds intelligent solutions for teams who want results, not experiments.

A few things worth noting:

Content truncation — Claude has a large context window, but we're still smart about token usage. Text gets 80K chars, HTML gets 20K for structural hints.
Zod validation — the AI's response is validated against a schema. If Claude returns unexpected structure, you get a clear error instead of silent corruption.
Response cleaning — sometimes Claude wraps JSON in markdown code fences despite being told not to. We handle that gracefully.

Step 6: Add Retry Logic and Rate Limiting

Create src/utils/retry.ts:

export interface RetryConfig {
  maxRetries: number;
  baseDelayMs: number;
  maxDelayMs: number;
}
 
const DEFAULT_CONFIG: RetryConfig = {
  maxRetries: 3,
  baseDelayMs: 1000,
  maxDelayMs: 10000,
};
 
export async function withRetry<T>(
  fn: () => Promise<T>,
  config: Partial<RetryConfig> = {}
): Promise<T> {
  const { maxRetries, baseDelayMs, maxDelayMs } = {
    ...DEFAULT_CONFIG,
    ...config,
  };
 
  let lastError: Error | undefined;
 
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error as Error;
 
      if (attempt === maxRetries) break;
 
      const delay = Math.min(
        baseDelayMs * Math.pow(2, attempt) + Math.random() * 1000,
        maxDelayMs
      );
 
      console.warn(
        `Attempt ${attempt + 1} failed: ${lastError.message}. ` +
        `Retrying in ${Math.round(delay)}ms...`
      );
 
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
 
  throw lastError;
}

Create src/utils/rate-limiter.ts:

export class RateLimiter {
  private lastCall = 0;
 
  constructor(private minIntervalMs: number) {}
 
  async wait(): Promise<void> {
    const now = Date.now();
    const elapsed = now - this.lastCall;
 
    if (elapsed < this.minIntervalMs) {
      const waitTime = this.minIntervalMs - elapsed;
      await new Promise((resolve) => setTimeout(resolve, waitTime));
    }
 
    this.lastCall = Date.now();
  }
}

Step 7: Build the Multi-Page Scraper

Create src/multi-page-scraper.ts — handles pagination:

import { BrowserScraper } from "./scraper";
import { AIExtractor } from "./ai-extractor";
import { RateLimiter } from "./utils/rate-limiter";
import { withRetry } from "./utils/retry";
import type { ScraperConfig, ExtractionResult, ScrapedItem } from "./types";
 
export class MultiPageScraper {
  private browserScraper: BrowserScraper;
  private aiExtractor: AIExtractor;
  private rateLimiter: RateLimiter;
 
  constructor(apiKey: string, rateLimitMs = 1000) {
    this.browserScraper = new BrowserScraper();
    this.aiExtractor = new AIExtractor(apiKey);
    this.rateLimiter = new RateLimiter(rateLimitMs);
  }
 
  async scrapeMultiplePages(
    configs: ScraperConfig[]
  ): Promise<ExtractionResult> {
    await this.browserScraper.launch();
 
    const allItems: ScrapedItem[] = [];
    let totalFound = 0;
 
    try {
      for (const [index, config] of configs.entries()) {
        console.log(
          `\n📄 Scraping page ${index + 1}/${configs.length}: ${config.url}`
        );
 
        await this.rateLimiter.wait();
 
        const result = await withRetry(async () => {
          const content = await this.browserScraper.scrape(config);
          console.log(
            `  ✓ Page loaded (${content.text.length} chars text)`
          );
 
          const extraction = await this.aiExtractor.extract(
            content,
            config.prompt,
            config.schema
          );
          console.log(
            `  ✓ Extracted ${extraction.items.length} items`
          );
 
          return extraction;
        });
 
        allItems.push(...result.items);
        totalFound += result.totalFound;
      }
    } finally {
      await this.browserScraper.close();
    }
 
    return {
      items: allItems,
      totalFound,
      pageInfo: {
        title: `Multi-page scrape (${configs.length} pages)`,
        url: configs[0]?.url || "",
        scrapedAt: new Date().toISOString(),
      },
    };
  }
}

Step 8: Create the CLI Interface

Create src/index.ts:

import "dotenv/config";
import { Command } from "commander";
import { BrowserScraper } from "./scraper";
import { AIExtractor } from "./ai-extractor";
import { MultiPageScraper } from "./multi-page-scraper";
import { withRetry } from "./utils/retry";
import { writeFileSync } from "fs";
 
const program = new Command();
 
program
  .name("ai-scraper")
  .description("AI-powered web scraper using Playwright and Claude")
  .version("1.0.0");
 
program
  .command("scrape")
  .description("Scrape a single URL")
  .requiredOption("-u, --url <url>", "URL to scrape")
  .requiredOption("-p, --prompt <prompt>", "What to extract")
  .option("-o, --output <file>", "Output JSON file")
  .option("-s, --screenshot", "Take a full-page screenshot")
  .option("--wait <selector>", "CSS selector to wait for")
  .option("--timeout <ms>", "Navigation timeout in ms", "30000")
  .action(async (options) => {
    const apiKey = process.env.ANTHROPIC_API_KEY;
    if (!apiKey) {
      console.error("❌ ANTHROPIC_API_KEY not set in .env");
      process.exit(1);
    }
 
    const scraper = new BrowserScraper();
    const extractor = new AIExtractor(apiKey);
 
    try {
      console.log(`🌐 Navigating to ${options.url}...`);
      await scraper.launch();
 
      const content = await withRetry(() =>
        scraper.scrape({
          url: options.url,
          prompt: options.prompt,
          waitForSelector: options.wait,
          screenshot: options.screenshot,
          timeout: parseInt(options.timeout),
        })
      );
 
      console.log(`📝 Page loaded: "${content.title}"`);
      console.log(`   Text: ${content.text.length} chars`);
      console.log(`   HTML: ${content.html.length} chars`);
 
      if (content.screenshot) {
        const screenshotPath = "screenshot.png";
        writeFileSync(screenshotPath, content.screenshot);
        console.log(`📸 Screenshot saved: ${screenshotPath}`);
      }
 
      console.log(`\n🤖 Sending to Claude for extraction...`);
      const result = await withRetry(() =>
        extractor.extract(content, options.prompt)
      );
 
      console.log(`✅ Extracted ${result.items.length} items\n`);
 
      const output = JSON.stringify(result, null, 2);
 
      if (options.output) {
        writeFileSync(options.output, output);
        console.log(`💾 Saved to ${options.output}`);
      } else {
        console.log(output);
      }
    } catch (error) {
      console.error(`❌ Error: ${(error as Error).message}`);
      process.exit(1);
    } finally {
      await scraper.close();
    }
  });
 
program
  .command("multi")
  .description("Scrape multiple URLs")
  .requiredOption("-u, --urls <urls...>", "URLs to scrape (space-separated)")
  .requiredOption("-p, --prompt <prompt>", "What to extract")
  .option("-o, --output <file>", "Output JSON file")
  .action(async (options) => {
    const apiKey = process.env.ANTHROPIC_API_KEY;
    if (!apiKey) {
      console.error("❌ ANTHROPIC_API_KEY not set in .env");
      process.exit(1);
    }
 
    const multiScraper = new MultiPageScraper(
      apiKey,
      parseInt(process.env.RATE_LIMIT_MS || "1000")
    );
 
    try {
      const configs = options.urls.map((url: string) => ({
        url,
        prompt: options.prompt,
      }));
 
      const result = await multiScraper.scrapeMultiplePages(configs);
 
      const output = JSON.stringify(result, null, 2);
 
      if (options.output) {
        writeFileSync(options.output, output);
        console.log(`\n💾 Saved ${result.items.length} items to ${options.output}`);
      } else {
        console.log(output);
      }
    } catch (error) {
      console.error(`❌ Error: ${(error as Error).message}`);
      process.exit(1);
    }
  });
 
program.parse();

Step 9: Add the Package Scripts

Update your package.json:

{
  "type": "module",
  "scripts": {
    "scrape": "tsx src/index.ts scrape",
    "multi": "tsx src/index.ts multi",
    "build": "tsc",
    "start": "node dist/index.js"
  }
}

Step 10: Test It — Real-World Examples

Example 1: Scrape product listings

npx tsx src/index.ts scrape \
  -u "https://books.toscrape.com" \
  -p "Extract all book titles, prices, ratings, and availability" \
  -o books.json

Expected output:

{
  "items": [
    {
      "title": "A Light in the Attic",
      "price": "£51.77",
      "metadata": {
        "rating": "Three",
        "availability": "In stock"
      }
    },
    {
      "title": "Tipping the Velvet",
      "price": "£53.74",
      "metadata": {
        "rating": "One",
        "availability": "In stock"
      }
    }
  ],
  "totalFound": 20,
  "pageInfo": {
    "title": "All products | Books to Scrape",
    "url": "https://books.toscrape.com",
    "scrapedAt": "2026-03-13T12:00:00.000Z"
  }
}

Example 2: Scrape job listings

npx tsx src/index.ts scrape \
  -u "https://news.ycombinator.com/jobs" \
  -p "Extract all job postings: company name, job title, posting date, and link" \
  -o hn-jobs.json

Example 3: Multi-page scrape

npx tsx src/index.ts multi \
  -u "https://books.toscrape.com/catalogue/page-1.html" \
     "https://books.toscrape.com/catalogue/page-2.html" \
     "https://books.toscrape.com/catalogue/page-3.html" \
  -p "Extract all book titles and prices" \
  -o all-books.json

Step 11: Add Custom Extraction Schemas

The real power comes from custom schemas. Create src/extractors/product-extractor.ts:

import { z } from "zod";
 
export const ProductSchema = z.object({
  items: z.array(
    z.object({
      name: z.string(),
      price: z.object({
        amount: z.number(),
        currency: z.string(),
      }),
      rating: z.number().min(0).max(5).optional(),
      reviewCount: z.number().optional(),
      availability: z.enum(["in_stock", "out_of_stock", "preorder"]),
      sku: z.string().optional(),
      brand: z.string().optional(),
      category: z.string().optional(),
      url: z.string().url(),
      imageUrl: z.string().url().optional(),
    })
  ),
  totalFound: z.number(),
  pageInfo: z.object({
    title: z.string(),
    url: z.string(),
    scrapedAt: z.string(),
  }),
});
 
export type ProductResult = z.infer<typeof ProductSchema>;

Use it in your scrape call:

import { ProductSchema } from "./extractors/product-extractor";
 
const result = await extractor.extract(content, userPrompt, ProductSchema);
// result is now fully typed with ProductResult shape

💡 Ready to go from reading to building? Talk to our team about implementing custom automation and API integrations for your business.

Tips and Best Practices

✅ Do

Use networkidle for JavaScript-heavy sites — it waits for all network requests to settle
Validate with Zod — never trust raw AI output in production
Implement rate limiting — be a good citizen; don't hammer target servers
Cache results — save raw HTML alongside extracted data for debugging
Use screenshots for debugging extraction issues

⚠️ Watch Out For

Robots.txt — always check and respect a site's crawling rules
Terms of Service — ensure scraping is permitted
Rate limits — both for target sites AND your Anthropic API
Token costs — large pages mean more tokens. Monitor your API usage
Dynamic content — some sites require specific interactions (clicks, scrolls) before content appears

❌ Don't

Scrape personal data without consent
Bypass authentication mechanisms
Ignore rate limits or robots.txt
Use this for spam or automated content generation without attribution

Architecture Overview

Here's how the components fit together:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   CLI Layer  │────▶│   Playwright  │────▶│  Target Site │
│  (Commander) │     │   (Browser)   │     │              │
└──────┬───────┘     └──────┬───────┘     └──────────────┘
       │                     │
       │              PageContent
       │              (html + text)
       │                     │
       │              ┌──────▼───────┐     ┌──────────────┐
       │              │ AI Extractor │────▶│  Claude API  │
       │              │   (Anthropic) │     │              │
       │              └──────┬───────┘     └──────────────┘
       │                     │
       │              Structured JSON
       │              (Zod-validated)
       │                     │
       ▼                     ▼
  ┌──────────────────────────────┐
  │  Output (stdout or .json)    │
  └──────────────────────────────┘

Going Further

Here are some ideas to extend this project:

Add a caching layer — store raw HTML in SQLite so you can re-extract without re-scraping
Build a web UI — wrap the CLI in a Next.js app with a form interface
Schedule scrapes — use cron jobs or a task queue to scrape on a schedule
Add proxy support — rotate proxies for large-scale scraping
Stream results — use Claude's streaming API for real-time extraction feedback
Vision extraction — send screenshots to Claude's vision capability for layout-dependent data

Summary

You've built a production-ready AI-powered web scraper that:

Uses Playwright for reliable browser automation (handles JS-rendered content)
Leverages Claude for intelligent, schema-free data extraction
Validates output with Zod for type safety
Supports multi-page scraping with rate limiting
Includes retry logic with exponential backoff

The key insight: instead of writing fragile CSS selectors that break on every site update, you describe what you want in natural language and let AI figure out where it lives. This approach is more resilient, more flexible, and surprisingly more accurate than traditional scraping.

The full source code is available to adapt for your specific use cases. Happy scraping — and remember to be ethical about it.

Building automation tools is one thing. Building them right — resilient, type-safe, production-ready — is another. At Noqta, we help teams implement AI-powered solutions that actually work in production. Let's talk about your next project.