Crawlee Web Scraping with TypeScript: Build Production Scrapers from Zero to Deployment

Web scraping done right. Crawlee is the open-source TypeScript framework by Apify that handles the hard parts — request queues, retries, proxy rotation, and anti-blocking — so you can focus on extracting data. In this tutorial, you will build a complete production scraper from scratch.
What You Will Learn
By the end of this tutorial, you will:
- Set up a Crawlee project with TypeScript from scratch
- Build scrapers using PlaywrightCrawler for JavaScript-heavy sites
- Use CheerioCrawler for fast, lightweight HTML scraping
- Manage request queues for crawling thousands of pages
- Store extracted data with Crawlee's built-in Dataset system
- Implement proxy rotation and anti-blocking strategies
- Handle pagination, infinite scroll, and dynamic content
- Deploy your scraper to production with Docker
Prerequisites
Before starting, ensure you have:
- Node.js 20+ installed (
node --version) - TypeScript experience (types, async/await, generics)
- Basic HTML/CSS knowledge (selectors, DOM structure)
- A code editor — VS Code or Cursor recommended
- Docker installed (optional, for deployment)
Why Crawlee?
Web scraping in Node.js often means stitching together Puppeteer, Cheerio, request libraries, retry logic, and queue management yourself. Crawlee provides all of this in a single, cohesive framework:
| Feature | Crawlee | DIY (Puppeteer + Cheerio) | Scrapy (Python) |
|---|---|---|---|
| Language | TypeScript/JavaScript | JavaScript | Python |
| Browser support | Playwright, Puppeteer | Manual setup | Splash/Selenium |
| Request queue | Built-in with persistence | Manual implementation | Built-in |
| Auto-retry | Configurable per request | Manual | Built-in |
| Proxy rotation | Built-in with session management | Manual | Middleware |
| Anti-blocking | Fingerprint generation, headers | Manual | Middleware |
| Data storage | Dataset + Key-Value Store | Manual (JSON/DB) | Item pipelines |
| Type safety | Full TypeScript | Optional | No |
Crawlee gives you Scrapy-level power with TypeScript-level safety and a modern developer experience.
Step 1: Project Setup
Start by creating a new Crawlee project. The CLI scaffolds everything you need:
npx crawlee create my-scraper --template playwright-ts
cd my-scraperThis generates the following project structure:
my-scraper/
├── src/
│ ├── main.ts # Entry point
│ ├── routes.ts # Route handlers
│ └── types.ts # Custom types (we will add this)
├── storage/ # Auto-created for datasets and queues
├── package.json
├── tsconfig.json
└── Dockerfile # Production-ready Docker setup
Install the dependencies:
npm installCrawlee installs three core packages:
crawlee— The framework core with crawlers, queues, and storageplaywright— Browser automation for JavaScript-rendered pages@crawlee/playwright— Playwright integration for Crawlee
Step 2: Understanding Crawlee's Architecture
Before writing code, let us understand how Crawlee works:
┌─────────────────────────────────────────────┐
│ Crawlee │
│ │
│ ┌──────────┐ ┌──────────┐ ┌─────────┐ │
│ │ Request │───▶│ Crawler │───▶│ Dataset │ │
│ │ Queue │ │ (Router) │ │ (Output)│ │
│ └──────────┘ └──────────┘ └─────────┘ │
│ │ │ │
│ │ ┌────┴────┐ │
│ │ │ Proxy │ │
│ │ │ Manager │ │
│ │ └─────────┘ │
│ ▼ │
│ Auto-retry on failure │
│ Concurrency control │
│ Rate limiting │
└─────────────────────────────────────────────┘
The key components are:
- Request Queue — Manages URLs to scrape, handles deduplication, and persists state across restarts
- Crawler — Processes each request using your handler function (Cheerio for HTML, Playwright for JS-rendered pages)
- Router — Routes different URL patterns to different handler functions
- Dataset — Stores extracted data as JSON lines, exportable to CSV, JSON, or any format
- Proxy Manager — Rotates proxies and manages sessions to avoid blocks
Step 3: Build a CheerioCrawler for Static Pages
Let us start with the fastest scraper type — CheerioCrawler. It downloads raw HTML and parses it with Cheerio (jQuery-like API), without launching a browser. Perfect for sites that do not require JavaScript rendering.
Replace the contents of src/main.ts:
import { CheerioCrawler, Dataset, log } from 'crawlee';
// Configure logging
log.setLevel(log.LEVELS.INFO);
// Create the crawler
const crawler = new CheerioCrawler({
// Maximum number of concurrent requests
maxConcurrency: 10,
// Maximum number of requests per minute (rate limiting)
maxRequestsPerMinute: 60,
// Retry failed requests up to 3 times
maxRequestRetries: 3,
// Handler for each page
async requestHandler({ request, $, enqueueLinks, pushData }) {
const url = request.url;
log.info(`Scraping: ${url}`);
// Extract data from the page using CSS selectors
const title = $('h1').first().text().trim();
const description = $('meta[name="description"]').attr('content') || '';
const links = $('a[href]')
.map((_, el) => $(el).attr('href'))
.get()
.filter((href) => href.startsWith('http'));
// Push extracted data to the dataset
await pushData({
url,
title,
description,
linksFound: links.length,
scrapedAt: new Date().toISOString(),
});
// Follow links on the page (breadth-first crawling)
await enqueueLinks({
// Only follow links matching this pattern
globs: ['https://example.com/**'],
// Limit crawl depth
label: 'DETAIL',
});
},
// Called when a request fails after all retries
async failedRequestHandler({ request }) {
log.error(`Failed: ${request.url} — ${request.errorMessages.join(', ')}`);
},
});
// Start the crawler with seed URLs
await crawler.run([
'https://example.com',
]);
// Export results
const dataset = await Dataset.open();
await dataset.exportToJSON('results');
log.info('Scraping complete! Results saved to storage/datasets/default/');Run it:
npx tsx src/main.tsYour extracted data is saved in storage/datasets/default/ as individual JSON files. You can also export the entire dataset:
# View the results
cat storage/datasets/default/*.json | head -50Step 4: Build a PlaywrightCrawler for Dynamic Sites
Many modern websites render content with JavaScript. For these, you need PlaywrightCrawler, which launches a real browser:
import { PlaywrightCrawler, Dataset, log } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Use headless Chromium
launchContext: {
launchOptions: {
headless: true,
},
},
// Browser pages are expensive — limit concurrency
maxConcurrency: 5,
// Timeout per page (30 seconds)
requestHandlerTimeoutSecs: 30,
async requestHandler({ request, page, enqueueLinks, pushData }) {
const url = request.url;
log.info(`Scraping (browser): ${url}`);
// Wait for the main content to render
await page.waitForSelector('.product-card', { timeout: 10000 });
// Extract product data using Playwright's evaluation
const products = await page.$$eval('.product-card', (cards) =>
cards.map((card) => ({
name: card.querySelector('.product-name')?.textContent?.trim() || '',
price: card.querySelector('.product-price')?.textContent?.trim() || '',
rating: card.querySelector('.product-rating')?.textContent?.trim() || '',
image: card.querySelector('img')?.getAttribute('src') || '',
}))
);
// Push each product to the dataset
for (const product of products) {
await pushData({
...product,
sourceUrl: url,
scrapedAt: new Date().toISOString(),
});
}
// Follow pagination links
await enqueueLinks({
selector: 'a.pagination-next',
label: 'LISTING',
});
},
});
await crawler.run([
'https://example-shop.com/products?page=1',
]);When to Use Which Crawler
| Scenario | Crawler | Why |
|---|---|---|
| Static HTML sites | CheerioCrawler | 10x faster, no browser overhead |
| JavaScript-rendered content | PlaywrightCrawler | Executes JS, waits for rendering |
| Single-page applications (SPAs) | PlaywrightCrawler | Handles client-side routing |
| APIs returning HTML | CheerioCrawler | Just need to parse HTML |
| Sites requiring login | PlaywrightCrawler | Can fill forms and handle auth |
Step 5: Use the Router for Multi-Page Patterns
Real scrapers need to handle different page types differently — listing pages, detail pages, search results. Crawlee's Router makes this clean:
Create src/routes.ts:
import { createPlaywrightRouter, Dataset } from 'crawlee';
export const router = createPlaywrightRouter();
// Default handler — listing pages
router.addDefaultHandler(async ({ request, page, enqueueLinks, log }) => {
log.info(`Processing listing: ${request.url}`);
// Extract links to individual items
await enqueueLinks({
selector: 'a.item-link',
label: 'DETAIL', // Route these to the DETAIL handler
});
// Handle pagination
const nextButton = await page.$('a.next-page');
if (nextButton) {
await enqueueLinks({
selector: 'a.next-page',
label: 'LISTING',
});
}
});
// Detail page handler
router.addHandler('DETAIL', async ({ request, page, pushData, log }) => {
log.info(`Processing detail: ${request.url}`);
// Wait for content to load
await page.waitForSelector('.article-content', { timeout: 10000 });
// Extract structured data
const data = await page.evaluate(() => {
const title = document.querySelector('h1')?.textContent?.trim() || '';
const author = document.querySelector('.author-name')?.textContent?.trim() || '';
const date = document.querySelector('time')?.getAttribute('datetime') || '';
const content = document.querySelector('.article-content')?.textContent?.trim() || '';
const tags = Array.from(document.querySelectorAll('.tag'))
.map((tag) => tag.textContent?.trim() || '');
return { title, author, date, content, tags };
});
await pushData({
...data,
url: request.url,
scrapedAt: new Date().toISOString(),
});
});
// Search results handler
router.addHandler('SEARCH', async ({ request, page, enqueueLinks, log }) => {
log.info(`Processing search: ${request.url}`);
const resultCount = await page.$$eval('.search-result', (results) => results.length);
log.info(`Found ${resultCount} results`);
// Enqueue each result as a DETAIL page
await enqueueLinks({
selector: '.search-result a',
label: 'DETAIL',
});
});Update src/main.ts to use the router:
import { PlaywrightCrawler } from 'crawlee';
import { router } from './routes.js';
const crawler = new PlaywrightCrawler({
requestHandler: router,
maxConcurrency: 5,
maxRequestsPerMinute: 30,
});
await crawler.run([
{ url: 'https://example-blog.com/articles', label: 'LISTING' },
{ url: 'https://example-blog.com/search?q=typescript', label: 'SEARCH' },
]);Step 6: Handle Pagination and Infinite Scroll
Traditional Pagination
For sites with "Next" buttons or numbered pages:
router.addHandler('LISTING', async ({ page, enqueueLinks, request, log }) => {
// Extract items on current page
const items = await page.$$eval('.item', (els) =>
els.map((el) => ({
title: el.querySelector('h2')?.textContent?.trim(),
url: el.querySelector('a')?.href,
}))
);
log.info(`Page ${request.userData.page || 1}: found ${items.length} items`);
// Enqueue detail pages
for (const item of items) {
if (item.url) {
await enqueueLinks({
urls: [item.url],
label: 'DETAIL',
});
}
}
// Check for next page
const nextUrl = await page.$eval('a.next', (el) => el.href).catch(() => null);
if (nextUrl) {
await enqueueLinks({
urls: [nextUrl],
label: 'LISTING',
userData: { page: (request.userData.page || 1) + 1 },
});
}
});Infinite Scroll
For sites that load more content as you scroll:
router.addHandler('INFINITE', async ({ page, pushData, log }) => {
let previousHeight = 0;
let scrollAttempts = 0;
const maxScrolls = 20;
while (scrollAttempts < maxScrolls) {
// Scroll to the bottom
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
// Wait for new content to load
await page.waitForTimeout(2000);
// Check if new content appeared
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) {
log.info('No more content to load');
break;
}
previousHeight = currentHeight;
scrollAttempts++;
log.info(`Scroll ${scrollAttempts}/${maxScrolls} — height: ${currentHeight}`);
}
// Now extract all loaded items
const allItems = await page.$$eval('.feed-item', (items) =>
items.map((item) => ({
text: item.querySelector('.content')?.textContent?.trim() || '',
author: item.querySelector('.author')?.textContent?.trim() || '',
timestamp: item.querySelector('time')?.getAttribute('datetime') || '',
}))
);
log.info(`Extracted ${allItems.length} items total`);
for (const item of allItems) {
await pushData(item);
}
});Step 7: Proxy Rotation and Anti-Blocking
Getting blocked is the biggest challenge in web scraping. Crawlee has built-in tools to help:
Basic Proxy Rotation
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://user:pass@proxy1.example.com:8080',
'http://user:pass@proxy2.example.com:8080',
'http://user:pass@proxy3.example.com:8080',
],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
requestHandler: router,
// Use a new session (IP + cookies) per request
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 100,
sessionOptions: {
maxUsageCount: 50, // Retire session after 50 uses
},
},
});Anti-Blocking Best Practices
const crawler = new PlaywrightCrawler({
// Randomize request timing
maxRequestsPerMinute: 20,
// Browser fingerprint randomization (built-in)
browserPoolOptions: {
useFingerprints: true,
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ['chrome', 'firefox'],
operatingSystems: ['windows', 'macos', 'linux'],
locales: ['en-US', 'en-GB'],
},
},
},
// Pre-navigation hooks for additional stealth
preNavigationHooks: [
async ({ page }) => {
// Randomize viewport size
const width = 1280 + Math.floor(Math.random() * 200);
const height = 800 + Math.floor(Math.random() * 200);
await page.setViewportSize({ width, height });
// Set realistic headers
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
});
},
],
requestHandler: router,
});Session Management
Crawlee's session pool tracks which sessions (proxy + cookies) are healthy and retires blocked ones:
import { PlaywrightCrawler, SessionPool } from 'crawlee';
const crawler = new PlaywrightCrawler({
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 50,
sessionOptions: {
maxUsageCount: 30,
maxErrorScore: 1, // Retire after 1 error (strict)
},
// Custom session creation
createSessionFunction: async (sessionPool) => {
const session = new SessionPool.Session({ sessionPool });
// Add custom cookies or auth tokens to the session
session.setCookies([
{ name: 'consent', value: 'true', domain: '.example.com' },
]);
return session;
},
},
async requestHandler({ session, request, page, pushData }) {
// Check if we got blocked
const title = await page.title();
if (title.includes('Access Denied') || title.includes('CAPTCHA')) {
// Mark this session as blocked
session?.retire();
throw new Error(`Blocked at ${request.url}`);
}
// Normal scraping logic...
await pushData({ title, url: request.url });
},
});Step 8: Data Storage and Export
Crawlee provides two storage systems:
Dataset — For Tabular Data
import { Dataset } from 'crawlee';
// Push data during scraping
await pushData({
name: 'TypeScript Handbook',
price: 29.99,
category: 'Programming',
});
// After scraping, export in different formats
const dataset = await Dataset.open();
// Export as JSON
await dataset.exportToJSON('output');
// Export as CSV
await dataset.exportToCSV('output');
// Iterate over all items
await dataset.forEach(async (item) => {
console.log(item.name, item.price);
});
// Get all data at once
const { items } = await dataset.getData();
console.log(`Total items: ${items.length}`);Key-Value Store — For Arbitrary Data
import { KeyValueStore } from 'crawlee';
const store = await KeyValueStore.open();
// Save screenshots
await store.setValue('homepage-screenshot', await page.screenshot(), {
contentType: 'image/png',
});
// Save configuration or state
await store.setValue('scraper-config', {
lastRun: new Date().toISOString(),
totalPages: 1500,
errors: 12,
});
// Read values back
const config = await store.getValue('scraper-config');Custom Storage with Database Export
For production scrapers, you often want to push data to a database:
import { PlaywrightCrawler } from 'crawlee';
import { Pool } from 'pg';
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
});
const crawler = new PlaywrightCrawler({
async requestHandler({ page, pushData, request }) {
const data = await page.evaluate(() => {
// ... extract data
return { title: '', price: 0, url: '' };
});
// Save to Crawlee dataset (for backup/debugging)
await pushData(data);
// Also save to PostgreSQL
await pool.query(
'INSERT INTO products (title, price, url, scraped_at) VALUES ($1, $2, $3, NOW()) ON CONFLICT (url) DO UPDATE SET price = $2, scraped_at = NOW()',
[data.title, data.price, request.url]
);
},
});Step 9: Error Handling and Resilience
Production scrapers must handle failures gracefully. Crawlee has built-in retry logic, but you should add custom error handling:
import { PlaywrightCrawler, log } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Retry configuration
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, page, pushData, session }) {
try {
// Check for common block indicators
const statusCode = (request.loadedUrl || request.url).includes('blocked')
? 403
: 200;
const pageTitle = await page.title();
// Detect soft blocks (page loads but shows CAPTCHA or error)
if (
pageTitle.toLowerCase().includes('captcha') ||
pageTitle.toLowerCase().includes('verify') ||
pageTitle.toLowerCase().includes('access denied')
) {
session?.retire();
throw new Error(`Soft block detected at ${request.url}`);
}
// Wait for content with a fallback
try {
await page.waitForSelector('.main-content', { timeout: 10000 });
} catch {
log.warning(`Content selector not found at ${request.url}, trying fallback`);
await page.waitForSelector('body', { timeout: 5000 });
}
// Extract data
const data = await page.evaluate(() => ({
title: document.querySelector('h1')?.textContent?.trim() || 'No title',
content: document.querySelector('.main-content')?.textContent?.trim() || '',
}));
await pushData({
...data,
url: request.url,
retryCount: request.retryCount,
});
} catch (error) {
// Log the error with context
log.error(`Error scraping ${request.url}`, {
error: (error as Error).message,
retryCount: request.retryCount,
});
throw error; // Re-throw to trigger Crawlee's retry mechanism
}
},
// Handle requests that failed all retries
async failedRequestHandler({ request, log }) {
log.error(`Permanently failed: ${request.url}`, {
errors: request.errorMessages,
retries: request.retryCount,
});
// Save failed URLs for manual review
const dataset = await Dataset.open('failed-requests');
await dataset.pushData({
url: request.url,
errors: request.errorMessages,
failedAt: new Date().toISOString(),
});
},
});Step 10: Real-World Example — Scraping a Job Board
Let us build a complete scraper that extracts job listings. This example demonstrates all the concepts together:
Create src/job-scraper.ts:
import { PlaywrightCrawler, Dataset, ProxyConfiguration, log } from 'crawlee';
// Types for our extracted data
interface JobListing {
title: string;
company: string;
location: string;
salary: string;
description: string;
tags: string[];
postedAt: string;
url: string;
scrapedAt: string;
}
// Configure the crawler
const crawler = new PlaywrightCrawler({
maxConcurrency: 3,
maxRequestsPerMinute: 15,
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 45,
// Stealth settings
browserPoolOptions: {
useFingerprints: true,
},
preNavigationHooks: [
async ({ page }) => {
// Block images and fonts to speed up scraping
await page.route('**/*.{png,jpg,jpeg,gif,webp,woff,woff2}', (route) =>
route.abort()
);
},
],
async requestHandler({ request, page, enqueueLinks, pushData, log }) {
const label = request.label || 'LISTING';
if (label === 'LISTING') {
log.info(`Scraping job listing page: ${request.url}`);
// Wait for job cards to load
await page.waitForSelector('.job-card', { timeout: 15000 });
// Extract job links and enqueue them
await enqueueLinks({
selector: '.job-card a.job-title-link',
label: 'JOB_DETAIL',
});
// Handle pagination
const hasNextPage = await page.$('a[aria-label="Next page"]');
if (hasNextPage) {
await enqueueLinks({
selector: 'a[aria-label="Next page"]',
label: 'LISTING',
});
}
}
if (label === 'JOB_DETAIL') {
log.info(`Scraping job detail: ${request.url}`);
await page.waitForSelector('.job-detail', { timeout: 15000 });
const job: JobListing = await page.evaluate(() => {
const getText = (selector: string): string =>
document.querySelector(selector)?.textContent?.trim() || '';
return {
title: getText('h1.job-title'),
company: getText('.company-name'),
location: getText('.job-location'),
salary: getText('.salary-range'),
description: getText('.job-description'),
tags: Array.from(document.querySelectorAll('.skill-tag')).map(
(tag) => tag.textContent?.trim() || ''
),
postedAt: getText('.posted-date'),
url: window.location.href,
scrapedAt: new Date().toISOString(),
};
});
// Validate before saving
if (job.title && job.company) {
await pushData(job);
log.info(`Saved: ${job.title} at ${job.company}`);
} else {
log.warning(`Incomplete data at ${request.url}`);
}
}
},
async failedRequestHandler({ request }) {
const dataset = await Dataset.open('failed');
await dataset.pushData({
url: request.url,
errors: request.errorMessages,
});
},
});
// Run the scraper
log.info('Starting job board scraper...');
await crawler.run([
{ url: 'https://example-jobs.com/jobs?q=typescript', label: 'LISTING' },
{ url: 'https://example-jobs.com/jobs?q=react', label: 'LISTING' },
]);
// Export results
const dataset = await Dataset.open();
const { items } = await dataset.getData();
log.info(`Scraping complete! Extracted ${items.length} job listings.`);
await dataset.exportToJSON('jobs');
await dataset.exportToCSV('jobs');Run the scraper:
npx tsx src/job-scraper.tsStep 11: Deploy with Docker
Crawlee projects come with a production-ready Dockerfile. Here is an optimized version:
FROM node:20-slim AS builder
# Install Playwright browser dependencies
RUN npx playwright install-deps chromium
WORKDIR /app
# Install dependencies
COPY package*.json ./
RUN npm ci --omit=dev
# Copy source code
COPY . .
# Build TypeScript
RUN npm run build
FROM node:20-slim
# Install Chromium for Playwright
RUN npx playwright install chromium
RUN npx playwright install-deps chromium
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package.json ./
# Create storage directory
RUN mkdir -p storage
# Set environment variables
ENV NODE_ENV=production
ENV CRAWLEE_STORAGE_DIR=./storage
CMD ["node", "dist/main.js"]Build and run:
docker build -t my-scraper .
docker run -v $(pwd)/output:/app/storage my-scraperRunning on a Schedule with Cron
# Run the scraper every day at 2 AM
0 2 * * * cd /opt/scrapers/my-scraper && docker run --rm -v /opt/data:/app/storage my-scraperStep 12: Advanced Patterns
Resumable Crawling
Crawlee persists its request queue to disk. If your scraper crashes, just restart it — it resumes where it left off:
const crawler = new PlaywrightCrawler({
// Enable persistent storage (default is local filesystem)
requestHandler: router,
});
// First run: processes all URLs
await crawler.run(['https://example.com/page1', 'https://example.com/page2']);
// If you restart, already-completed URLs are skipped automaticallyCustom Request Transformation
import { PlaywrightCrawler, Request } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, pushData }) {
// Access custom data attached to the request
const { category, priority } = request.userData;
const data = await page.evaluate(() => ({
title: document.title,
}));
await pushData({
...data,
category,
priority,
});
},
});
// Enqueue requests with custom metadata
await crawler.run([
new Request({
url: 'https://example.com/electronics',
userData: { category: 'electronics', priority: 'high' },
}),
new Request({
url: 'https://example.com/books',
userData: { category: 'books', priority: 'medium' },
}),
]);Intercepting Network Requests
Monitor and modify network traffic during scraping:
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page, request }) => {
// Intercept API responses to get data directly
page.on('response', async (response) => {
const url = response.url();
if (url.includes('/api/products')) {
try {
const data = await response.json();
// Store the API response directly — often cleaner than DOM scraping
const store = await KeyValueStore.open();
await store.setValue(
`api-response-${Date.now()}`,
data
);
} catch {
// Response might not be JSON
}
}
});
// Block unnecessary resources
await page.route('**/*', (route) => {
const type = route.request().resourceType();
if (['image', 'font', 'stylesheet'].includes(type)) {
return route.abort();
}
return route.continue();
});
},
],
async requestHandler({ page, pushData }) {
// The page loads faster since we blocked images/fonts/CSS
await page.waitForSelector('.content');
const title = await page.title();
await pushData({ title });
},
});Testing Your Scraper
Before running against real sites, test with a local server:
// src/test-server.ts
import { createServer } from 'http';
const html = `
<!DOCTYPE html>
<html>
<body>
<h1>Test Page</h1>
<div class="product-card">
<span class="product-name">Widget A</span>
<span class="product-price">$19.99</span>
</div>
<div class="product-card">
<span class="product-name">Widget B</span>
<span class="product-price">$29.99</span>
</div>
<a href="/page2" class="next-page">Next</a>
</body>
</html>
`;
const server = createServer((req, res) => {
res.writeHead(200, { 'Content-Type': 'text/html' });
res.end(html);
});
server.listen(3333, () => console.log('Test server at http://localhost:3333'));Run the test server and your scraper against it:
# Terminal 1
npx tsx src/test-server.ts
# Terminal 2
# Update your crawler to target http://localhost:3333
npx tsx src/main.tsTroubleshooting
Common Issues
Browser fails to launch in Docker: Make sure you install Playwright dependencies:
npx playwright install-deps chromiumGetting blocked frequently:
- Reduce
maxConcurrencyandmaxRequestsPerMinute - Enable proxy rotation
- Use
useFingerprints: truein browser pool options - Add random delays between requests
Memory issues with large crawls:
- Use
CheerioCrawlerinstead ofPlaywrightCrawlerwhere possible - Limit
maxConcurrencyto reduce memory usage - Set
maxRequestsPerCrawlto process in batches - Close browser pages explicitly if needed
Request queue growing too large:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 1000, // Stop after 1000 requests
requestHandler: router,
});Next Steps
Now that you have a working Crawlee scraper, here are some ways to extend it:
- Add a database — Store results in PostgreSQL using Drizzle ORM (see our Drizzle tutorial)
- Build an API — Serve scraped data through a REST API with Hono (see our Hono tutorial)
- Schedule runs — Use GitHub Actions (see our CI/CD tutorial) or cron jobs
- Add AI extraction — Combine with Claude for intelligent data parsing (see our AI scraper tutorial)
- Monitor with dashboards — Track scraping metrics with OpenTelemetry
Conclusion
You have built a production-grade web scraper with Crawlee and TypeScript. You now know how to:
- Choose between
CheerioCrawlerandPlaywrightCrawlerbased on your target site - Use routers to handle different page types cleanly
- Implement pagination and infinite scroll handling
- Rotate proxies and manage sessions to avoid blocks
- Store and export data in multiple formats
- Deploy your scraper with Docker for production use
Crawlee abstracts away the hard parts of web scraping — queue management, retries, proxy rotation, browser fingerprinting — so you can focus on writing the extraction logic that matters. Combined with TypeScript's type safety, you get scrapers that are both reliable and maintainable.
Happy scraping!
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.
Related Articles

Build Production AI Agents with the Claude Agent SDK and TypeScript
Learn how to build autonomous AI agents using Anthropic's Claude Agent SDK in TypeScript. This hands-on tutorial covers the agent loop, built-in tools, custom MCP tools, subagents, permission modes, and production deployment patterns.

End-to-End Testing with Playwright and Next.js: From Zero to CI Pipeline
Learn how to set up Playwright for end-to-end testing in a Next.js application. This hands-on tutorial covers setup, Page Object Model, visual regression, accessibility testing, and CI/CD integration with GitHub Actions.

Building an Autonomous AI Agent with Agentic RAG and Next.js
Learn how to build an AI agent that autonomously decides when and how to retrieve information from vector databases. A comprehensive hands-on guide using Vercel AI SDK and Next.js with executable examples.