writing/tutorial/2026/05
TutorialMay 29, 2026·28 min read

Firecrawl + Next.js: AI-Powered Web Data Extraction Tutorial

Learn to build AI-powered web data extraction pipelines using Firecrawl and Next.js 15. Covers scraping, LLM-structured extraction with Zod schemas, website crawling, and a production-ready intelligence dashboard.

Web scraping has fundamentally changed in the age of AI. Traditional scrapers break constantly as websites update their HTML structure. Firecrawl solves this by combining intelligent web crawling with LLM-powered extraction — you define what you want, not where to find it in the DOM.

In this tutorial, you'll build a Competitor Intelligence Dashboard using Firecrawl and Next.js 15 that:

  • Scrapes any web page and returns clean Markdown
  • Extracts structured data (product names, pricing, features) using AI and Zod schemas
  • Crawls entire documentation sites or product catalogs
  • Displays real-time results in a production-ready dashboard

Prerequisites

Before starting, ensure you have:

  • Node.js 20+ installed
  • Basic knowledge of Next.js App Router and TypeScript
  • A Firecrawl account and API key (free tier: 500 credits/month at firecrawl.dev)
  • Familiarity with Zod for schema validation

What You'll Build

A Next.js 15 application with:

  1. API routes that interface with the Firecrawl SDK
  2. Zod-validated extraction for structured competitor product data
  3. Async crawl jobs with status polling for large sites
  4. A dashboard UI displaying intelligence cards with pricing and features

Step 1: Project Setup

Create a new Next.js 15 project with TypeScript:

npx create-next-app@latest competitor-intel --typescript --tailwind --app
cd competitor-intel

Install the Firecrawl JavaScript SDK and Zod:

npm install @mendable/firecrawl-js zod

Add your API key to .env.local:

FIRECRAWL_API_KEY=fc-your-api-key-here

Step 2: Configure the Firecrawl Client

Create a reusable client module at lib/firecrawl.ts:

import FirecrawlApp from '@mendable/firecrawl-js';
 
if (!process.env.FIRECRAWL_API_KEY) {
  throw new Error('FIRECRAWL_API_KEY is not defined');
}
 
export const firecrawl = new FirecrawlApp({
  apiKey: process.env.FIRECRAWL_API_KEY,
});

This singleton pattern prevents creating multiple instances during server-side rendering.

Step 3: Scraping a Single Page

Firecrawl's scrape endpoint fetches a URL and returns clean data in multiple formats. Create app/api/scrape/route.ts:

import { NextRequest, NextResponse } from 'next/server';
import { firecrawl } from '@/lib/firecrawl';
 
export async function POST(request: NextRequest) {
  const { url } = await request.json();
 
  if (!url || typeof url !== 'string') {
    return NextResponse.json({ error: 'URL is required' }, { status: 400 });
  }
 
  try {
    const result = await firecrawl.scrapeUrl(url, {
      formats: ['markdown', 'html'],
    });
 
    if (!result.success) {
      return NextResponse.json({ error: 'Scrape failed' }, { status: 500 });
    }
 
    return NextResponse.json({
      markdown: result.markdown,
      title: result.metadata?.title,
      description: result.metadata?.description,
    });
  } catch (error) {
    return NextResponse.json(
      { error: 'Failed to scrape URL' },
      { status: 500 }
    );
  }
}

The formats array controls what Firecrawl returns. markdown gives you clean, readable text (perfect for LLMs), while html returns the raw markup.

Step 4: LLM-Structured Extraction with Zod

The real power of Firecrawl comes from schema-driven extraction — you define a Zod schema, and Firecrawl uses an LLM to extract matching fields from any page, regardless of HTML structure.

Define your product schema at lib/schemas.ts:

import { z } from 'zod';
 
export const ProductSchema = z.object({
  name: z.string().describe('Product or service name'),
  tagline: z.string().optional().describe('Main marketing tagline'),
  pricing: z
    .array(
      z.object({
        plan: z.string(),
        price: z.string(),
        features: z.array(z.string()),
      })
    )
    .optional()
    .describe('Pricing tiers with features'),
  mainFeatures: z.array(z.string()).describe('Top 5 key features'),
  targetAudience: z.string().optional().describe('Who the product is for'),
  techStack: z.array(z.string()).optional().describe('Technologies mentioned'),
});
 
export type Product = z.infer<typeof ProductSchema>;

Now create the extraction API route at app/api/extract/route.ts:

import { NextRequest, NextResponse } from 'next/server';
import { firecrawl } from '@/lib/firecrawl';
import { ProductSchema } from '@/lib/schemas';
 
export async function POST(request: NextRequest) {
  const { url } = await request.json();
 
  if (!url || typeof url !== 'string') {
    return NextResponse.json({ error: 'URL is required' }, { status: 400 });
  }
 
  try {
    const result = await firecrawl.scrapeUrl(url, {
      formats: ['extract'],
      extract: {
        schema: ProductSchema,
        prompt:
          'Extract product information, pricing tiers, and key features from this page.',
      },
    });
 
    if (!result.success || !result.extract) {
      return NextResponse.json({ error: 'Extraction failed' }, { status: 500 });
    }
 
    return NextResponse.json({ data: result.extract });
  } catch (error) {
    return NextResponse.json(
      { error: 'Failed to extract data' },
      { status: 500 }
    );
  }
}

The extract format sends the scraped content through an LLM and returns data matching your Zod schema. This keeps working even after the site completely redesigns its layout.

Step 5: Crawling Entire Websites

For crawling multiple pages (a documentation site or product catalog), use asyncCrawlUrl. Create app/api/crawl/route.ts:

import { NextRequest, NextResponse } from 'next/server';
import { firecrawl } from '@/lib/firecrawl';
 
export async function POST(request: NextRequest) {
  const { url, limit = 10 } = await request.json();
 
  if (!url || typeof url !== 'string') {
    return NextResponse.json({ error: 'URL is required' }, { status: 400 });
  }
 
  try {
    const crawlResponse = await firecrawl.asyncCrawlUrl(url, {
      limit,
      scrapeOptions: {
        formats: ['markdown'],
      },
      excludePaths: ['/blog/*', '/news/*'],
    });
 
    if (!crawlResponse.success) {
      return NextResponse.json(
        { error: 'Crawl failed to start' },
        { status: 500 }
      );
    }
 
    return NextResponse.json({ jobId: crawlResponse.id });
  } catch (error) {
    return NextResponse.json(
      { error: 'Failed to start crawl' },
      { status: 500 }
    );
  }
}

asyncCrawlUrl returns a job ID immediately. Poll for completion:

// app/api/crawl/status/route.ts
import { NextRequest, NextResponse } from 'next/server';
import { firecrawl } from '@/lib/firecrawl';
 
export async function GET(request: NextRequest) {
  const jobId = request.nextUrl.searchParams.get('jobId');
 
  if (!jobId) {
    return NextResponse.json({ error: 'jobId is required' }, { status: 400 });
  }
 
  const status = await firecrawl.checkCrawlStatus(jobId);
 
  return NextResponse.json({
    status: status.status,
    completed: status.completed,
    total: status.total,
    pages: status.status === 'completed' ? status.data : [],
  });
}

Step 6: URL Discovery with Map

Before committing crawl credits to an entire site, use mapUrl to discover available pages:

const siteMap = await firecrawl.mapUrl('https://competitor.com', {
  search: 'pricing', // Filter URLs containing this keyword
  limit: 50,
});
 
console.log(siteMap.links);
// ['https://competitor.com/pricing', 'https://competitor.com/pricing/enterprise', ...]

This is ideal for targeted crawling — discover the relevant pages first, then crawl only those.

Step 7: Building the Dashboard UI

Create the main page at app/page.tsx:

'use client';
 
import { useState } from 'react';
import type { Product } from '@/lib/schemas';
 
export default function IntelligenceDashboard() {
  const [url, setUrl] = useState('');
  const [loading, setLoading] = useState(false);
  const [product, setProduct] = useState<Product | null>(null);
  const [error, setError] = useState<string | null>(null);
 
  async function handleExtract() {
    if (!url) return;
    setLoading(true);
    setError(null);
 
    try {
      const res = await fetch('/api/extract', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ url }),
      });
      const json = await res.json();
 
      if (!res.ok) {
        setError(json.error ?? 'Extraction failed');
        return;
      }
      setProduct(json.data);
    } catch {
      setError('Network error — please try again');
    } finally {
      setLoading(false);
    }
  }
 
  return (
    <div className="max-w-4xl mx-auto p-8">
      <h1 className="text-3xl font-bold mb-2">Competitor Intelligence</h1>
      <p className="text-gray-600 mb-8">
        Enter any competitor URL to extract structured product data with AI.
      </p>
 
      <div className="flex gap-2 mb-8">
        <input
          type="url"
          value={url}
          onChange={(e) => setUrl(e.target.value)}
          placeholder="https://competitor.com/pricing"
          className="flex-1 border rounded-lg px-4 py-2 text-sm"
        />
        <button
          onClick={handleExtract}
          disabled={loading || !url}
          className="bg-orange-500 text-white px-6 py-2 rounded-lg disabled:opacity-50"
        >
          {loading ? 'Extracting...' : 'Extract'}
        </button>
      </div>
 
      {error && (
        <div className="bg-red-50 border border-red-200 rounded-lg p-4 mb-6 text-red-700">
          {error}
        </div>
      )}
 
      {product && <ProductCard product={product} />}
    </div>
  );
}
 
function ProductCard({ product }: { product: Product }) {
  return (
    <div className="border rounded-xl p-6 space-y-4">
      <div>
        <h2 className="text-2xl font-bold">{product.name}</h2>
        {product.tagline && (
          <p className="text-gray-600 mt-1">{product.tagline}</p>
        )}
        {product.targetAudience && (
          <p className="text-sm text-orange-600 mt-1">
            For: {product.targetAudience}
          </p>
        )}
      </div>
 
      <div>
        <h3 className="font-semibold mb-2">Key Features</h3>
        <ul className="list-disc list-inside space-y-1 text-sm text-gray-700">
          {product.mainFeatures.map((f, i) => (
            <li key={i}>{f}</li>
          ))}
        </ul>
      </div>
 
      {product.pricing && product.pricing.length > 0 && (
        <div>
          <h3 className="font-semibold mb-2">Pricing Plans</h3>
          <div className="grid grid-cols-1 md:grid-cols-3 gap-3">
            {product.pricing.map((plan, i) => (
              <div key={i} className="border rounded-lg p-3 text-sm">
                <div className="font-medium">{plan.plan}</div>
                <div className="text-orange-600 font-bold">{plan.price}</div>
                <ul className="mt-2 space-y-1 text-gray-600">
                  {plan.features.slice(0, 3).map((f, j) => (
                    <li key={j} className="truncate">
                      • {f}
                    </li>
                  ))}
                </ul>
              </div>
            ))}
          </div>
        </div>
      )}
 
      {product.techStack && product.techStack.length > 0 && (
        <div>
          <h3 className="font-semibold mb-2">Tech Stack Mentioned</h3>
          <div className="flex flex-wrap gap-2">
            {product.techStack.map((tech, i) => (
              <span key={i} className="bg-gray-100 px-2 py-1 rounded text-xs">
                {tech}
              </span>
            ))}
          </div>
        </div>
      )}
    </div>
  );
}

Step 8: Rate Limiting and Retry Logic

Firecrawl enforces rate limits based on your plan. Add exponential backoff for resilience:

// lib/firecrawl-retry.ts
import { firecrawl } from './firecrawl';
 
export async function scrapeWithRetry(
  url: string,
  options = {},
  maxRetries = 3
) {
  let lastError: Error | null = null;
 
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await firecrawl.scrapeUrl(url, options);
    } catch (error) {
      lastError = error as Error;
 
      if (attempt < maxRetries) {
        // Exponential backoff: 1s, 2s, 4s
        await new Promise((resolve) =>
          setTimeout(resolve, Math.pow(2, attempt - 1) * 1000)
        );
      }
    }
  }
 
  throw lastError;
}

When processing multiple URLs in batch, add a small delay between requests:

async function batchScrape(urls: string[]) {
  const results = [];
 
  for (const url of urls) {
    const result = await scrapeWithRetry(url);
    results.push(result);
    // 500ms between requests prevents rate limit errors
    await new Promise((resolve) => setTimeout(resolve, 500));
  }
 
  return results;
}

Step 9: Caching Results with Next.js

Firecrawl credits are limited, so cache extraction results to avoid redundant API calls:

import { unstable_cache } from 'next/cache';
import { firecrawl } from '@/lib/firecrawl';
import { ProductSchema } from '@/lib/schemas';
 
export const getCachedProductData = unstable_cache(
  async (url: string) => {
    const result = await firecrawl.scrapeUrl(url, {
      formats: ['extract'],
      extract: { schema: ProductSchema },
    });
    return result.extract;
  },
  ['product-data'],
  { revalidate: 3600 } // Cache for 1 hour
);

Testing Your Implementation

  1. Start the dev server: npm run dev
  2. Navigate to http://localhost:3000
  3. Enter a competitor pricing URL (try https://vercel.com/pricing)
  4. Click Extract and wait for the AI extraction (typically 3-8 seconds)
  5. Verify the structured data matches what is displayed on the page

A successful extraction for a pricing page should return plan names, prices, and feature lists — all without any CSS selectors.

Troubleshooting

"API key not found": Ensure FIRECRAWL_API_KEY is in .env.local and restart the dev server after adding it.

"Scrape failed" on certain sites: Some sites aggressively block scrapers. Firecrawl has built-in anti-bot bypass for most sites. For JavaScript-heavy SPAs, add a waitFor option:

const result = await firecrawl.scrapeUrl(url, {
  formats: ['extract'],
  waitFor: 2000, // Wait 2 seconds for JS rendering
  extract: { schema: ProductSchema },
});

Empty extraction results: The LLM extraction works best with content-rich pages. Ensure the target page has sufficient visible text about the product.

Function timeout in production: For large crawl jobs, increase the maximum function duration in your route:

// Top of app/api/crawl/route.ts
export const maxDuration = 60;

Deployment on Vercel

  1. Add FIRECRAWL_API_KEY to your Vercel project environment variables
  2. For crawl routes, increase maxDuration to 60 seconds as shown above
  3. Always use asyncCrawlUrl (not crawlUrl) in production to avoid synchronous timeouts

Next Steps

  • Add a database (Neon + Drizzle) to persist extracted competitor data over time
  • Schedule weekly re-scrapes with Trigger.dev for freshness monitoring
  • Combine with the Vercel AI SDK to auto-generate competitive analysis reports
  • Explore Firecrawl's Agent API for fully autonomous data gathering tasks

Conclusion

Firecrawl transforms web scraping from a fragile HTML-parsing exercise into a resilient AI-powered pipeline. The combination of scrapeUrl for single pages, asyncCrawlUrl for entire sites, and schema-based extract for structured data covers the vast majority of web data extraction needs in modern AI applications. With Zod schema validation and Next.js caching, you get a production-ready pipeline that delivers structured competitor intelligence without maintaining brittle CSS selectors.