# web-scraper-builder

> Build production-ready web scrapers using Crawlee and Playwright. This skill guides you through analyzing websites, identifying selectors, and generating crawler code that follows best practices.

Use when:
- User wants to "scrape a website", "extract data from a page", or "build a crawler"
- User provides a URL and asks to extract product/deal/content information
- User needs help finding CSS selectors for web elements
- User wants to create an Apify actor for data extraction
- User asks about "web scraping", "DOM analysis", or "Playwright selectors"

Triggers: "scrape", "crawler", "extract from website", "web scraping", "find selectors", "Apify actor", "Playwright", "Crawlee"

- Author: Jack Zhu
- Repository: JackZhu426/web-scraper-crawlee-skill
- Version: 20260202142357
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/JackZhu426/web-scraper-crawlee-skill
- Web: https://mule.run/skillshub/@@JackZhu426/web-scraper-crawlee-skill~web-scraper-builder:20260202142357

---

---
name: web-scraper-builder
description: |
  Build production-ready web scrapers using Crawlee and Playwright. This skill guides you through analyzing websites, identifying selectors, and generating crawler code that follows best practices.

  Use when:
  - User wants to "scrape a website", "extract data from a page", or "build a crawler"
  - User provides a URL and asks to extract product/deal/content information
  - User needs help finding CSS selectors for web elements
  - User wants to create an Apify actor for data extraction
  - User asks about "web scraping", "DOM analysis", or "Playwright selectors"

  Triggers: "scrape", "crawler", "extract from website", "web scraping", "find selectors", "Apify actor", "Playwright", "Crawlee"
---

# Web Scraper Builder

Build production-ready Crawlee + Playwright crawlers through guided analysis and code generation.

## Workflow

### Phase 1: Gather Requirements

Ask the user:

1. **Target URL**: "What's the URL of the page you want to scrape?"
2. **Data fields**: "What data do you need to extract? (e.g., product title, price, description, image, availability)"
3. **Page type**: "Is this a listing page (multiple items) or a detail page (single item)?"
4. **Analysis method**: "Would you like me to analyze the page live using browser tools, or would you prefer to provide HTML snippets?"

### Phase 2: Analyze Page Structure

**Option A: Live Analysis (Recommended)**
Use `mcp__claude-in-chrome__*` or `mcp__chrome-devtools__*` tools to:
1. Navigate to the target URL
2. Take a screenshot for visual reference
3. Analyze the DOM structure
4. Identify data-testid attributes, semantic elements, and class patterns

**Option B: User-Provided HTML**
If the user provides HTML snippets:
1. Analyze the provided markup
2. Identify stable selector patterns
3. Suggest multiple fallback strategies

### Phase 3: Build Selectors

Follow these selector priority rules (most stable → least stable):

```typescript
// Priority 1: data-testid (most stable)
page.locator('[data-testid="product-title"]')

// Priority 2: Semantic/ARIA
page.getByRole('button', { name: /add to cart/i })
page.getByLabel('Search')

// Priority 3: Partial class match (resilient to hash suffixes)
page.locator('[class*="product-card"]')
page.locator('div[class*="price"]')

// Priority 4: Attribute selectors
page.locator('img[src*="cdn"]')
page.locator('a[href*="/product/"]')

// Priority 5: Fallback chains using .or()
page.locator('[data-testid="price"]')
  .or(page.locator('[class*="price"]'))
  .or(page.locator('.price'))
```

**Always avoid:**
- Exact class names with hashes: `.product-card-abc123`
- Deep nth-child chains: `div:nth-child(3) > div:nth-child(2)`
- Overly generic selectors: `span`, `div`

See `references/selector-patterns.md` for comprehensive selector strategies.

### Phase 4: Generate Crawler Code

Use the template at `scripts/crawler-template.ts` as a starting point. Customize:

1. **Input interface** - Define expected inputs (URLs, limits, flags)
2. **Router handlers** - Create handlers for listing vs detail pages
3. **Extraction logic** - Implement field extraction with fallbacks
4. **Error handling** - Wrap extractions in try-catch blocks

### Phase 5: Test & Iterate

1. Run the crawler locally: `pnpm start:dev` or `apify run --purge`
2. Check extracted data in `storage/datasets/default/`
3. Refine selectors based on edge cases
4. Add missing fallbacks for failed extractions

## Key Patterns

### Always use `page.locator()`, never `document.querySelector()`

```typescript
// ✅ Correct - Auto-waits, better errors
const title = await page.locator('h1[class*="title"]').textContent();

// ❌ Wrong - No auto-waiting, fragile
const title = await page.evaluate(() =>
  document.querySelector('h1.title')?.textContent
);
```

### Batch extraction (for listing pages)

```typescript
const products = await page.evaluate(() => {
  return Array.from(document.querySelectorAll('[data-testid="product-card"]'))
    .map(card => ({
      name: card.querySelector('[data-testid="title"]')?.textContent?.trim(),
      price: card.querySelector('[data-testid="price"]')?.textContent?.trim(),
      url: card.querySelector('a')?.getAttribute('href'),
    }))
    .filter(Boolean);
});
```

### Waiting strategies

```typescript
// Wait for element
await page.locator('[data-testid="products"]').waitFor({ timeout: 30000 });

// Wait for network idle (after clicking "Load More")
await page.waitForLoadState('networkidle', { timeout: 15000 });

// Wait for count to increase
await page.waitForFunction(
  (min) => document.querySelectorAll('.product').length > min,
  currentCount,
  { timeout: 15000 }
);
```

### Handling popups/modals

```typescript
const cookieBtn = page.locator('button:has-text("Accept")').first();
if (await cookieBtn.isVisible({ timeout: 3000 }).catch(() => false)) {
  await cookieBtn.click();
}
await page.keyboard.press('Escape'); // Fallback
```

## Resources

- `scripts/crawler-template.ts` - Complete Crawlee + Playwright boilerplate
- `references/selector-patterns.md` - Comprehensive selector strategy guide
- `references/common-ecommerce-patterns.md` - Patterns for product pages, deals, pagination

## Output

Deliver:
1. **Selectors document** - List of identified selectors with confidence ratings
2. **Crawler code** - Complete TypeScript file using Crawlee + Playwright
3. **Testing instructions** - How to run and validate the crawler