Crawlee Web Scraping with TypeScript: Build Production Scrapers from Zero to Deployment

AI Bot
By AI Bot ·

Loading the Text to Speech Audio Player...

Web scraping done right. Crawlee is the open-source TypeScript framework by Apify that handles the hard parts — request queues, retries, proxy rotation, and anti-blocking — so you can focus on extracting data. In this tutorial, you will build a complete production scraper from scratch.

What You Will Learn

By the end of this tutorial, you will:

  • Set up a Crawlee project with TypeScript from scratch
  • Build scrapers using PlaywrightCrawler for JavaScript-heavy sites
  • Use CheerioCrawler for fast, lightweight HTML scraping
  • Manage request queues for crawling thousands of pages
  • Store extracted data with Crawlee's built-in Dataset system
  • Implement proxy rotation and anti-blocking strategies
  • Handle pagination, infinite scroll, and dynamic content
  • Deploy your scraper to production with Docker

Prerequisites

Before starting, ensure you have:

  • Node.js 20+ installed (node --version)
  • TypeScript experience (types, async/await, generics)
  • Basic HTML/CSS knowledge (selectors, DOM structure)
  • A code editor — VS Code or Cursor recommended
  • Docker installed (optional, for deployment)

Why Crawlee?

Web scraping in Node.js often means stitching together Puppeteer, Cheerio, request libraries, retry logic, and queue management yourself. Crawlee provides all of this in a single, cohesive framework:

FeatureCrawleeDIY (Puppeteer + Cheerio)Scrapy (Python)
LanguageTypeScript/JavaScriptJavaScriptPython
Browser supportPlaywright, PuppeteerManual setupSplash/Selenium
Request queueBuilt-in with persistenceManual implementationBuilt-in
Auto-retryConfigurable per requestManualBuilt-in
Proxy rotationBuilt-in with session managementManualMiddleware
Anti-blockingFingerprint generation, headersManualMiddleware
Data storageDataset + Key-Value StoreManual (JSON/DB)Item pipelines
Type safetyFull TypeScriptOptionalNo

Crawlee gives you Scrapy-level power with TypeScript-level safety and a modern developer experience.


Step 1: Project Setup

Start by creating a new Crawlee project. The CLI scaffolds everything you need:

npx crawlee create my-scraper --template playwright-ts
cd my-scraper

This generates the following project structure:

my-scraper/
├── src/
│   ├── main.ts          # Entry point
│   ├── routes.ts        # Route handlers
│   └── types.ts         # Custom types (we will add this)
├── storage/             # Auto-created for datasets and queues
├── package.json
├── tsconfig.json
└── Dockerfile           # Production-ready Docker setup

Install the dependencies:

npm install

Crawlee installs three core packages:

  • crawlee — The framework core with crawlers, queues, and storage
  • playwright — Browser automation for JavaScript-rendered pages
  • @crawlee/playwright — Playwright integration for Crawlee

Step 2: Understanding Crawlee's Architecture

Before writing code, let us understand how Crawlee works:

┌─────────────────────────────────────────────┐
│                  Crawlee                      │
│                                               │
│  ┌──────────┐    ┌──────────┐    ┌─────────┐ │
│  │ Request  │───▶│ Crawler  │───▶│ Dataset │ │
│  │  Queue   │    │ (Router) │    │ (Output)│ │
│  └──────────┘    └──────────┘    └─────────┘ │
│       │               │                       │
│       │          ┌────┴────┐                  │
│       │          │  Proxy  │                  │
│       │          │ Manager │                  │
│       │          └─────────┘                  │
│       ▼                                       │
│  Auto-retry on failure                        │
│  Concurrency control                          │
│  Rate limiting                                │
└─────────────────────────────────────────────┘

The key components are:

  1. Request Queue — Manages URLs to scrape, handles deduplication, and persists state across restarts
  2. Crawler — Processes each request using your handler function (Cheerio for HTML, Playwright for JS-rendered pages)
  3. Router — Routes different URL patterns to different handler functions
  4. Dataset — Stores extracted data as JSON lines, exportable to CSV, JSON, or any format
  5. Proxy Manager — Rotates proxies and manages sessions to avoid blocks

Step 3: Build a CheerioCrawler for Static Pages

Let us start with the fastest scraper type — CheerioCrawler. It downloads raw HTML and parses it with Cheerio (jQuery-like API), without launching a browser. Perfect for sites that do not require JavaScript rendering.

Replace the contents of src/main.ts:

import { CheerioCrawler, Dataset, log } from 'crawlee';
 
// Configure logging
log.setLevel(log.LEVELS.INFO);
 
// Create the crawler
const crawler = new CheerioCrawler({
  // Maximum number of concurrent requests
  maxConcurrency: 10,
 
  // Maximum number of requests per minute (rate limiting)
  maxRequestsPerMinute: 60,
 
  // Retry failed requests up to 3 times
  maxRequestRetries: 3,
 
  // Handler for each page
  async requestHandler({ request, $, enqueueLinks, pushData }) {
    const url = request.url;
    log.info(`Scraping: ${url}`);
 
    // Extract data from the page using CSS selectors
    const title = $('h1').first().text().trim();
    const description = $('meta[name="description"]').attr('content') || '';
    const links = $('a[href]')
      .map((_, el) => $(el).attr('href'))
      .get()
      .filter((href) => href.startsWith('http'));
 
    // Push extracted data to the dataset
    await pushData({
      url,
      title,
      description,
      linksFound: links.length,
      scrapedAt: new Date().toISOString(),
    });
 
    // Follow links on the page (breadth-first crawling)
    await enqueueLinks({
      // Only follow links matching this pattern
      globs: ['https://example.com/**'],
      // Limit crawl depth
      label: 'DETAIL',
    });
  },
 
  // Called when a request fails after all retries
  async failedRequestHandler({ request }) {
    log.error(`Failed: ${request.url} — ${request.errorMessages.join(', ')}`);
  },
});
 
// Start the crawler with seed URLs
await crawler.run([
  'https://example.com',
]);
 
// Export results
const dataset = await Dataset.open();
await dataset.exportToJSON('results');
log.info('Scraping complete! Results saved to storage/datasets/default/');

Run it:

npx tsx src/main.ts

Your extracted data is saved in storage/datasets/default/ as individual JSON files. You can also export the entire dataset:

# View the results
cat storage/datasets/default/*.json | head -50

Step 4: Build a PlaywrightCrawler for Dynamic Sites

Many modern websites render content with JavaScript. For these, you need PlaywrightCrawler, which launches a real browser:

import { PlaywrightCrawler, Dataset, log } from 'crawlee';
 
const crawler = new PlaywrightCrawler({
  // Use headless Chromium
  launchContext: {
    launchOptions: {
      headless: true,
    },
  },
 
  // Browser pages are expensive — limit concurrency
  maxConcurrency: 5,
 
  // Timeout per page (30 seconds)
  requestHandlerTimeoutSecs: 30,
 
  async requestHandler({ request, page, enqueueLinks, pushData }) {
    const url = request.url;
    log.info(`Scraping (browser): ${url}`);
 
    // Wait for the main content to render
    await page.waitForSelector('.product-card', { timeout: 10000 });
 
    // Extract product data using Playwright's evaluation
    const products = await page.$$eval('.product-card', (cards) =>
      cards.map((card) => ({
        name: card.querySelector('.product-name')?.textContent?.trim() || '',
        price: card.querySelector('.product-price')?.textContent?.trim() || '',
        rating: card.querySelector('.product-rating')?.textContent?.trim() || '',
        image: card.querySelector('img')?.getAttribute('src') || '',
      }))
    );
 
    // Push each product to the dataset
    for (const product of products) {
      await pushData({
        ...product,
        sourceUrl: url,
        scrapedAt: new Date().toISOString(),
      });
    }
 
    // Follow pagination links
    await enqueueLinks({
      selector: 'a.pagination-next',
      label: 'LISTING',
    });
  },
});
 
await crawler.run([
  'https://example-shop.com/products?page=1',
]);

When to Use Which Crawler

ScenarioCrawlerWhy
Static HTML sitesCheerioCrawler10x faster, no browser overhead
JavaScript-rendered contentPlaywrightCrawlerExecutes JS, waits for rendering
Single-page applications (SPAs)PlaywrightCrawlerHandles client-side routing
APIs returning HTMLCheerioCrawlerJust need to parse HTML
Sites requiring loginPlaywrightCrawlerCan fill forms and handle auth

Step 5: Use the Router for Multi-Page Patterns

Real scrapers need to handle different page types differently — listing pages, detail pages, search results. Crawlee's Router makes this clean:

Create src/routes.ts:

import { createPlaywrightRouter, Dataset } from 'crawlee';
 
export const router = createPlaywrightRouter();
 
// Default handler — listing pages
router.addDefaultHandler(async ({ request, page, enqueueLinks, log }) => {
  log.info(`Processing listing: ${request.url}`);
 
  // Extract links to individual items
  await enqueueLinks({
    selector: 'a.item-link',
    label: 'DETAIL',  // Route these to the DETAIL handler
  });
 
  // Handle pagination
  const nextButton = await page.$('a.next-page');
  if (nextButton) {
    await enqueueLinks({
      selector: 'a.next-page',
      label: 'LISTING',
    });
  }
});
 
// Detail page handler
router.addHandler('DETAIL', async ({ request, page, pushData, log }) => {
  log.info(`Processing detail: ${request.url}`);
 
  // Wait for content to load
  await page.waitForSelector('.article-content', { timeout: 10000 });
 
  // Extract structured data
  const data = await page.evaluate(() => {
    const title = document.querySelector('h1')?.textContent?.trim() || '';
    const author = document.querySelector('.author-name')?.textContent?.trim() || '';
    const date = document.querySelector('time')?.getAttribute('datetime') || '';
    const content = document.querySelector('.article-content')?.textContent?.trim() || '';
    const tags = Array.from(document.querySelectorAll('.tag'))
      .map((tag) => tag.textContent?.trim() || '');
 
    return { title, author, date, content, tags };
  });
 
  await pushData({
    ...data,
    url: request.url,
    scrapedAt: new Date().toISOString(),
  });
});
 
// Search results handler
router.addHandler('SEARCH', async ({ request, page, enqueueLinks, log }) => {
  log.info(`Processing search: ${request.url}`);
 
  const resultCount = await page.$$eval('.search-result', (results) => results.length);
  log.info(`Found ${resultCount} results`);
 
  // Enqueue each result as a DETAIL page
  await enqueueLinks({
    selector: '.search-result a',
    label: 'DETAIL',
  });
});

Update src/main.ts to use the router:

import { PlaywrightCrawler } from 'crawlee';
import { router } from './routes.js';
 
const crawler = new PlaywrightCrawler({
  requestHandler: router,
  maxConcurrency: 5,
  maxRequestsPerMinute: 30,
});
 
await crawler.run([
  { url: 'https://example-blog.com/articles', label: 'LISTING' },
  { url: 'https://example-blog.com/search?q=typescript', label: 'SEARCH' },
]);

Step 6: Handle Pagination and Infinite Scroll

Traditional Pagination

For sites with "Next" buttons or numbered pages:

router.addHandler('LISTING', async ({ page, enqueueLinks, request, log }) => {
  // Extract items on current page
  const items = await page.$$eval('.item', (els) =>
    els.map((el) => ({
      title: el.querySelector('h2')?.textContent?.trim(),
      url: el.querySelector('a')?.href,
    }))
  );
 
  log.info(`Page ${request.userData.page || 1}: found ${items.length} items`);
 
  // Enqueue detail pages
  for (const item of items) {
    if (item.url) {
      await enqueueLinks({
        urls: [item.url],
        label: 'DETAIL',
      });
    }
  }
 
  // Check for next page
  const nextUrl = await page.$eval('a.next', (el) => el.href).catch(() => null);
  if (nextUrl) {
    await enqueueLinks({
      urls: [nextUrl],
      label: 'LISTING',
      userData: { page: (request.userData.page || 1) + 1 },
    });
  }
});

Infinite Scroll

For sites that load more content as you scroll:

router.addHandler('INFINITE', async ({ page, pushData, log }) => {
  let previousHeight = 0;
  let scrollAttempts = 0;
  const maxScrolls = 20;
 
  while (scrollAttempts < maxScrolls) {
    // Scroll to the bottom
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
 
    // Wait for new content to load
    await page.waitForTimeout(2000);
 
    // Check if new content appeared
    const currentHeight = await page.evaluate(() => document.body.scrollHeight);
    if (currentHeight === previousHeight) {
      log.info('No more content to load');
      break;
    }
 
    previousHeight = currentHeight;
    scrollAttempts++;
    log.info(`Scroll ${scrollAttempts}/${maxScrolls} — height: ${currentHeight}`);
  }
 
  // Now extract all loaded items
  const allItems = await page.$$eval('.feed-item', (items) =>
    items.map((item) => ({
      text: item.querySelector('.content')?.textContent?.trim() || '',
      author: item.querySelector('.author')?.textContent?.trim() || '',
      timestamp: item.querySelector('time')?.getAttribute('datetime') || '',
    }))
  );
 
  log.info(`Extracted ${allItems.length} items total`);
 
  for (const item of allItems) {
    await pushData(item);
  }
});

Step 7: Proxy Rotation and Anti-Blocking

Getting blocked is the biggest challenge in web scraping. Crawlee has built-in tools to help:

Basic Proxy Rotation

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
 
const proxyConfiguration = new ProxyConfiguration({
  proxyUrls: [
    'http://user:pass@proxy1.example.com:8080',
    'http://user:pass@proxy2.example.com:8080',
    'http://user:pass@proxy3.example.com:8080',
  ],
});
 
const crawler = new PlaywrightCrawler({
  proxyConfiguration,
  requestHandler: router,
 
  // Use a new session (IP + cookies) per request
  useSessionPool: true,
  sessionPoolOptions: {
    maxPoolSize: 100,
    sessionOptions: {
      maxUsageCount: 50,  // Retire session after 50 uses
    },
  },
});

Anti-Blocking Best Practices

const crawler = new PlaywrightCrawler({
  // Randomize request timing
  maxRequestsPerMinute: 20,
 
  // Browser fingerprint randomization (built-in)
  browserPoolOptions: {
    useFingerprints: true,
    fingerprintOptions: {
      fingerprintGeneratorOptions: {
        browsers: ['chrome', 'firefox'],
        operatingSystems: ['windows', 'macos', 'linux'],
        locales: ['en-US', 'en-GB'],
      },
    },
  },
 
  // Pre-navigation hooks for additional stealth
  preNavigationHooks: [
    async ({ page }) => {
      // Randomize viewport size
      const width = 1280 + Math.floor(Math.random() * 200);
      const height = 800 + Math.floor(Math.random() * 200);
      await page.setViewportSize({ width, height });
 
      // Set realistic headers
      await page.setExtraHTTPHeaders({
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
      });
    },
  ],
 
  requestHandler: router,
});

Session Management

Crawlee's session pool tracks which sessions (proxy + cookies) are healthy and retires blocked ones:

import { PlaywrightCrawler, SessionPool } from 'crawlee';
 
const crawler = new PlaywrightCrawler({
  useSessionPool: true,
  sessionPoolOptions: {
    maxPoolSize: 50,
    sessionOptions: {
      maxUsageCount: 30,
      maxErrorScore: 1,  // Retire after 1 error (strict)
    },
    // Custom session creation
    createSessionFunction: async (sessionPool) => {
      const session = new SessionPool.Session({ sessionPool });
      // Add custom cookies or auth tokens to the session
      session.setCookies([
        { name: 'consent', value: 'true', domain: '.example.com' },
      ]);
      return session;
    },
  },
 
  async requestHandler({ session, request, page, pushData }) {
    // Check if we got blocked
    const title = await page.title();
    if (title.includes('Access Denied') || title.includes('CAPTCHA')) {
      // Mark this session as blocked
      session?.retire();
      throw new Error(`Blocked at ${request.url}`);
    }
 
    // Normal scraping logic...
    await pushData({ title, url: request.url });
  },
});

Step 8: Data Storage and Export

Crawlee provides two storage systems:

Dataset — For Tabular Data

import { Dataset } from 'crawlee';
 
// Push data during scraping
await pushData({
  name: 'TypeScript Handbook',
  price: 29.99,
  category: 'Programming',
});
 
// After scraping, export in different formats
const dataset = await Dataset.open();
 
// Export as JSON
await dataset.exportToJSON('output');
 
// Export as CSV
await dataset.exportToCSV('output');
 
// Iterate over all items
await dataset.forEach(async (item) => {
  console.log(item.name, item.price);
});
 
// Get all data at once
const { items } = await dataset.getData();
console.log(`Total items: ${items.length}`);

Key-Value Store — For Arbitrary Data

import { KeyValueStore } from 'crawlee';
 
const store = await KeyValueStore.open();
 
// Save screenshots
await store.setValue('homepage-screenshot', await page.screenshot(), {
  contentType: 'image/png',
});
 
// Save configuration or state
await store.setValue('scraper-config', {
  lastRun: new Date().toISOString(),
  totalPages: 1500,
  errors: 12,
});
 
// Read values back
const config = await store.getValue('scraper-config');

Custom Storage with Database Export

For production scrapers, you often want to push data to a database:

import { PlaywrightCrawler } from 'crawlee';
import { Pool } from 'pg';
 
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
});
 
const crawler = new PlaywrightCrawler({
  async requestHandler({ page, pushData, request }) {
    const data = await page.evaluate(() => {
      // ... extract data
      return { title: '', price: 0, url: '' };
    });
 
    // Save to Crawlee dataset (for backup/debugging)
    await pushData(data);
 
    // Also save to PostgreSQL
    await pool.query(
      'INSERT INTO products (title, price, url, scraped_at) VALUES ($1, $2, $3, NOW()) ON CONFLICT (url) DO UPDATE SET price = $2, scraped_at = NOW()',
      [data.title, data.price, request.url]
    );
  },
});

Step 9: Error Handling and Resilience

Production scrapers must handle failures gracefully. Crawlee has built-in retry logic, but you should add custom error handling:

import { PlaywrightCrawler, log } from 'crawlee';
 
const crawler = new PlaywrightCrawler({
  // Retry configuration
  maxRequestRetries: 3,
  requestHandlerTimeoutSecs: 60,
 
  async requestHandler({ request, page, pushData, session }) {
    try {
      // Check for common block indicators
      const statusCode = (request.loadedUrl || request.url).includes('blocked')
        ? 403
        : 200;
 
      const pageTitle = await page.title();
 
      // Detect soft blocks (page loads but shows CAPTCHA or error)
      if (
        pageTitle.toLowerCase().includes('captcha') ||
        pageTitle.toLowerCase().includes('verify') ||
        pageTitle.toLowerCase().includes('access denied')
      ) {
        session?.retire();
        throw new Error(`Soft block detected at ${request.url}`);
      }
 
      // Wait for content with a fallback
      try {
        await page.waitForSelector('.main-content', { timeout: 10000 });
      } catch {
        log.warning(`Content selector not found at ${request.url}, trying fallback`);
        await page.waitForSelector('body', { timeout: 5000 });
      }
 
      // Extract data
      const data = await page.evaluate(() => ({
        title: document.querySelector('h1')?.textContent?.trim() || 'No title',
        content: document.querySelector('.main-content')?.textContent?.trim() || '',
      }));
 
      await pushData({
        ...data,
        url: request.url,
        retryCount: request.retryCount,
      });
    } catch (error) {
      // Log the error with context
      log.error(`Error scraping ${request.url}`, {
        error: (error as Error).message,
        retryCount: request.retryCount,
      });
      throw error; // Re-throw to trigger Crawlee's retry mechanism
    }
  },
 
  // Handle requests that failed all retries
  async failedRequestHandler({ request, log }) {
    log.error(`Permanently failed: ${request.url}`, {
      errors: request.errorMessages,
      retries: request.retryCount,
    });
 
    // Save failed URLs for manual review
    const dataset = await Dataset.open('failed-requests');
    await dataset.pushData({
      url: request.url,
      errors: request.errorMessages,
      failedAt: new Date().toISOString(),
    });
  },
});

Step 10: Real-World Example — Scraping a Job Board

Let us build a complete scraper that extracts job listings. This example demonstrates all the concepts together:

Create src/job-scraper.ts:

import { PlaywrightCrawler, Dataset, ProxyConfiguration, log } from 'crawlee';
 
// Types for our extracted data
interface JobListing {
  title: string;
  company: string;
  location: string;
  salary: string;
  description: string;
  tags: string[];
  postedAt: string;
  url: string;
  scrapedAt: string;
}
 
// Configure the crawler
const crawler = new PlaywrightCrawler({
  maxConcurrency: 3,
  maxRequestsPerMinute: 15,
  maxRequestRetries: 3,
  requestHandlerTimeoutSecs: 45,
 
  // Stealth settings
  browserPoolOptions: {
    useFingerprints: true,
  },
 
  preNavigationHooks: [
    async ({ page }) => {
      // Block images and fonts to speed up scraping
      await page.route('**/*.{png,jpg,jpeg,gif,webp,woff,woff2}', (route) =>
        route.abort()
      );
    },
  ],
 
  async requestHandler({ request, page, enqueueLinks, pushData, log }) {
    const label = request.label || 'LISTING';
 
    if (label === 'LISTING') {
      log.info(`Scraping job listing page: ${request.url}`);
 
      // Wait for job cards to load
      await page.waitForSelector('.job-card', { timeout: 15000 });
 
      // Extract job links and enqueue them
      await enqueueLinks({
        selector: '.job-card a.job-title-link',
        label: 'JOB_DETAIL',
      });
 
      // Handle pagination
      const hasNextPage = await page.$('a[aria-label="Next page"]');
      if (hasNextPage) {
        await enqueueLinks({
          selector: 'a[aria-label="Next page"]',
          label: 'LISTING',
        });
      }
    }
 
    if (label === 'JOB_DETAIL') {
      log.info(`Scraping job detail: ${request.url}`);
 
      await page.waitForSelector('.job-detail', { timeout: 15000 });
 
      const job: JobListing = await page.evaluate(() => {
        const getText = (selector: string): string =>
          document.querySelector(selector)?.textContent?.trim() || '';
 
        return {
          title: getText('h1.job-title'),
          company: getText('.company-name'),
          location: getText('.job-location'),
          salary: getText('.salary-range'),
          description: getText('.job-description'),
          tags: Array.from(document.querySelectorAll('.skill-tag')).map(
            (tag) => tag.textContent?.trim() || ''
          ),
          postedAt: getText('.posted-date'),
          url: window.location.href,
          scrapedAt: new Date().toISOString(),
        };
      });
 
      // Validate before saving
      if (job.title && job.company) {
        await pushData(job);
        log.info(`Saved: ${job.title} at ${job.company}`);
      } else {
        log.warning(`Incomplete data at ${request.url}`);
      }
    }
  },
 
  async failedRequestHandler({ request }) {
    const dataset = await Dataset.open('failed');
    await dataset.pushData({
      url: request.url,
      errors: request.errorMessages,
    });
  },
});
 
// Run the scraper
log.info('Starting job board scraper...');
await crawler.run([
  { url: 'https://example-jobs.com/jobs?q=typescript', label: 'LISTING' },
  { url: 'https://example-jobs.com/jobs?q=react', label: 'LISTING' },
]);
 
// Export results
const dataset = await Dataset.open();
const { items } = await dataset.getData();
log.info(`Scraping complete! Extracted ${items.length} job listings.`);
await dataset.exportToJSON('jobs');
await dataset.exportToCSV('jobs');

Run the scraper:

npx tsx src/job-scraper.ts

Step 11: Deploy with Docker

Crawlee projects come with a production-ready Dockerfile. Here is an optimized version:

FROM node:20-slim AS builder
 
# Install Playwright browser dependencies
RUN npx playwright install-deps chromium
 
WORKDIR /app
 
# Install dependencies
COPY package*.json ./
RUN npm ci --omit=dev
 
# Copy source code
COPY . .
 
# Build TypeScript
RUN npm run build
 
FROM node:20-slim
 
# Install Chromium for Playwright
RUN npx playwright install chromium
RUN npx playwright install-deps chromium
 
WORKDIR /app
 
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package.json ./
 
# Create storage directory
RUN mkdir -p storage
 
# Set environment variables
ENV NODE_ENV=production
ENV CRAWLEE_STORAGE_DIR=./storage
 
CMD ["node", "dist/main.js"]

Build and run:

docker build -t my-scraper .
docker run -v $(pwd)/output:/app/storage my-scraper

Running on a Schedule with Cron

# Run the scraper every day at 2 AM
0 2 * * * cd /opt/scrapers/my-scraper && docker run --rm -v /opt/data:/app/storage my-scraper

Step 12: Advanced Patterns

Resumable Crawling

Crawlee persists its request queue to disk. If your scraper crashes, just restart it — it resumes where it left off:

const crawler = new PlaywrightCrawler({
  // Enable persistent storage (default is local filesystem)
  requestHandler: router,
});
 
// First run: processes all URLs
await crawler.run(['https://example.com/page1', 'https://example.com/page2']);
 
// If you restart, already-completed URLs are skipped automatically

Custom Request Transformation

import { PlaywrightCrawler, Request } from 'crawlee';
 
const crawler = new PlaywrightCrawler({
  async requestHandler({ request, page, pushData }) {
    // Access custom data attached to the request
    const { category, priority } = request.userData;
 
    const data = await page.evaluate(() => ({
      title: document.title,
    }));
 
    await pushData({
      ...data,
      category,
      priority,
    });
  },
});
 
// Enqueue requests with custom metadata
await crawler.run([
  new Request({
    url: 'https://example.com/electronics',
    userData: { category: 'electronics', priority: 'high' },
  }),
  new Request({
    url: 'https://example.com/books',
    userData: { category: 'books', priority: 'medium' },
  }),
]);

Intercepting Network Requests

Monitor and modify network traffic during scraping:

const crawler = new PlaywrightCrawler({
  preNavigationHooks: [
    async ({ page, request }) => {
      // Intercept API responses to get data directly
      page.on('response', async (response) => {
        const url = response.url();
        if (url.includes('/api/products')) {
          try {
            const data = await response.json();
            // Store the API response directly — often cleaner than DOM scraping
            const store = await KeyValueStore.open();
            await store.setValue(
              `api-response-${Date.now()}`,
              data
            );
          } catch {
            // Response might not be JSON
          }
        }
      });
 
      // Block unnecessary resources
      await page.route('**/*', (route) => {
        const type = route.request().resourceType();
        if (['image', 'font', 'stylesheet'].includes(type)) {
          return route.abort();
        }
        return route.continue();
      });
    },
  ],
 
  async requestHandler({ page, pushData }) {
    // The page loads faster since we blocked images/fonts/CSS
    await page.waitForSelector('.content');
    const title = await page.title();
    await pushData({ title });
  },
});

Testing Your Scraper

Before running against real sites, test with a local server:

// src/test-server.ts
import { createServer } from 'http';
 
const html = `
<!DOCTYPE html>
<html>
<body>
  <h1>Test Page</h1>
  <div class="product-card">
    <span class="product-name">Widget A</span>
    <span class="product-price">$19.99</span>
  </div>
  <div class="product-card">
    <span class="product-name">Widget B</span>
    <span class="product-price">$29.99</span>
  </div>
  <a href="/page2" class="next-page">Next</a>
</body>
</html>
`;
 
const server = createServer((req, res) => {
  res.writeHead(200, { 'Content-Type': 'text/html' });
  res.end(html);
});
 
server.listen(3333, () => console.log('Test server at http://localhost:3333'));

Run the test server and your scraper against it:

# Terminal 1
npx tsx src/test-server.ts
 
# Terminal 2
# Update your crawler to target http://localhost:3333
npx tsx src/main.ts

Troubleshooting

Common Issues

Browser fails to launch in Docker: Make sure you install Playwright dependencies:

npx playwright install-deps chromium

Getting blocked frequently:

  • Reduce maxConcurrency and maxRequestsPerMinute
  • Enable proxy rotation
  • Use useFingerprints: true in browser pool options
  • Add random delays between requests

Memory issues with large crawls:

  • Use CheerioCrawler instead of PlaywrightCrawler where possible
  • Limit maxConcurrency to reduce memory usage
  • Set maxRequestsPerCrawl to process in batches
  • Close browser pages explicitly if needed

Request queue growing too large:

const crawler = new PlaywrightCrawler({
  maxRequestsPerCrawl: 1000,  // Stop after 1000 requests
  requestHandler: router,
});

Next Steps

Now that you have a working Crawlee scraper, here are some ways to extend it:


Conclusion

You have built a production-grade web scraper with Crawlee and TypeScript. You now know how to:

  • Choose between CheerioCrawler and PlaywrightCrawler based on your target site
  • Use routers to handle different page types cleanly
  • Implement pagination and infinite scroll handling
  • Rotate proxies and manage sessions to avoid blocks
  • Store and export data in multiple formats
  • Deploy your scraper with Docker for production use

Crawlee abstracts away the hard parts of web scraping — queue management, retries, proxy rotation, browser fingerprinting — so you can focus on writing the extraction logic that matters. Combined with TypeScript's type safety, you get scrapers that are both reliable and maintainable.

Happy scraping!


Want to read more tutorials? Check out our latest tutorial on Leverage cloud storage for enhanced backup strategies.

Discuss Your Project with Us

We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.

Let's find the best solutions for your needs.

Related Articles