# crawl4ai
> Web crawling and data extraction using the crwl CLI. Use this skill when users need to scrape websites, extract markdown content, handle JavaScript-heavy pages, or extract structured data from web pages.
- Author: Marcel Hild
- Repository: durandom/skills
- Version: 20260206095937
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/durandom/skills
- Web: https://mule.run/skillshub/@@durandom/skills~crawl4ai:20260206095937
---
---
name: crawl4ai
description: Web crawling and data extraction using the crwl CLI. Use this skill when users need to scrape websites, extract markdown content, handle JavaScript-heavy pages, or extract structured data from web pages.
version: 1.0.0
crawl4ai_cli: crwl
last_updated: 2025-01-19
---
Extract web content as clean markdown using the `crwl` CLI. This skill handles static sites, JavaScript-rendered SPAs, and structured data extraction without requiring Python code.
**Primary use cases:**
- Convert documentation sites to markdown
- Scrape JavaScript-heavy pages (React, Vue, Obsidian Publish)
- Extract structured data from e-commerce, news, or catalog sites
- Batch process multiple URLs
**Basic crawl:**
```bash
crwl crawl -o md https://example.com
```
**Save to file:**
```bash
crwl crawl -o md -O output.md https://example.com
```
**JavaScript-heavy site (use pre-made config):**
```bash
crwl crawl -C configs/spa.yaml -bc -o md "https://spa-site.com"
```
```bash
# Using uv (recommended) - include [all] for ML features
uv tool install 'crawl4ai[all]'
# Or with pip
pip install 'crawl4ai[all]'
# Run setup after installation
crawl4ai-setup
# Verify installation
crwl --help
crawl4ai-doctor
```
> **Note:** The `[all]` extra is required for ML features. Without it, `crawl4ai-download-models` will fail.
**Markdown extraction:**
```bash
crwl crawl -o md https://docs.example.com # Basic
crwl crawl -o md-fit https://docs.example.com # Filtered (removes boilerplate)
crwl crawl -o md -O docs.md https://docs.example.com # Save to file
```
**Structured data extraction:**
```bash
crwl crawl -j "extract product names and prices" https://shop.com # LLM-based
crwl crawl -s schema.json https://shop.com # Schema-based (faster)
```
**Deep crawling:**
```bash
crwl crawl --deep-crawl bfs --max-pages 10 https://docs.example.com # Breadth-first
crwl crawl --deep-crawl dfs --max-pages 5 https://blog.example.com # Depth-first
```
**Question answering:**
```bash
crwl crawl -q "What are the main features?" https://product.example.com
```
**Critical pattern for JavaScript-heavy sites (React, Vue, Obsidian Publish, etc.):**
CSS selectors alone don't workâthe element exists but content loads asynchronously. Use a **JS wait condition** that checks for actual rendered text.
**Use the pre-made SPA config:**
```bash
crwl crawl -C configs/spa.yaml -bc -o md "https://spa-site.com"
```
**Or create a site-specific config:**
```yaml
# my_site_config.yaml
page_timeout: 60000
wait_for: "js:() => document.body.innerText.includes('expected content')"
```
```bash
crwl crawl -C my_site_config.yaml -bc -o md "https://my-spa-site.com"
```
**Why `-bc` (bypass cache)?** SPA content can be cached before JS renders. Always use `-bc` for SPAs.
**Pre-made configs in `configs/` folder:**
| Config | Use Case |
|--------|----------|
| `spa.yaml` | Generic SPA (waits for content length > 1000) |
| `obsidian-publish.yaml` | Obsidian Publish sites |
| `slow-site.yaml` | Sites with slow loading (60s timeout) |
**Example usage:**
```bash
# Obsidian Publish site
crwl crawl -C configs/obsidian-publish.yaml -bc -o md "https://plugins.javalent.com/statblocks/readme/bestiary"
# Generic SPA
crwl crawl -C configs/spa.yaml -bc -o md "https://react-app.example.com"
```
| Format | Flag | Description |
|--------|------|-------------|
| Markdown | `-o md` | Clean markdown |
| Filtered markdown | `-o md-fit` | Removes navigation, ads, boilerplate |
| JSON | `-o json` | Full JSON with metadata, links, media |
| All | `-o all` | Everything (default) |
**Crawler parameters (`-c`):**
```bash
crwl crawl -c wait_for=css:.content URL # Wait for element
crwl crawl -c page_timeout=60000 URL # Extended timeout (ms)
crwl crawl -c remove_overlay_elements=true URL # Remove popups
```
**Browser parameters (`-b`):**
```bash
crwl crawl -b viewport_width=1920,viewport_height=1080 URL # Custom viewport
crwl crawl -b user_agent="Mozilla/5.0..." URL # Custom user agent
```
**Config files (`-C`, `-B`):**
```bash
crwl crawl -C crawler_config.yaml URL # Crawler config
crwl crawl -B browser_config.yaml URL # Browser config
```
**Example schema.json for structured extraction:**
```json
{
"name": "products",
"baseSelector": "div.product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
```
**Usage:**
```bash
crwl crawl -s schema.json https://shop.com
```
For multiple URLs, use the provided script:
```bash
./scripts/batch_crawl.sh urls.txt output_dir
```
**Or inline:**
```bash
while read url; do
crwl crawl -o md -O "output/$(echo $url | sed 's|https://||;s|/|_|g').md" "$url"
done < urls.txt
```
**Symptom:** Getting navigation/skeleton but no content
**Solution:** Use SPA config with JS wait condition:
```bash
crwl crawl -C configs/spa.yaml -bc -o md URL
```
If still failing, create site-specific config:
```yaml
page_timeout: 60000
wait_for: "js:() => document.body.innerText.includes('expected text')"
```
**Symptom:** Empty response or access denied
**Solution:** Use realistic browser settings:
```bash
crwl crawl \
-b viewport_width=1920,viewport_height=1080 \
-b user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)" \
URL
```
**Symptom:** Minimal or no content extracted
**Solution:** Check with verbose mode, then try waiting for content:
```bash
crwl crawl -v URL
crwl crawl -c wait_for=css:main URL
```
```
crwl crawl [OPTIONS] URL
Options:
-o, --output [md|md-fit|json|all] Output format
-O, --output-file PATH Save to file
-C, --crawler-config PATH Crawler config file (YAML/JSON)
-B, --browser-config PATH Browser config file (YAML/JSON)
-s, --schema PATH JSON schema for extraction
-j, --json-extract TEXT LLM extraction instruction
-q, --question TEXT Ask question about content
-c, --crawler TEXT Crawler params (key=value)
-b, --browser TEXT Browser params (key=value)
-bc, --bypass-cache Skip cache
--deep-crawl [bfs|dfs|best-first] Multi-page crawling
--max-pages INTEGER Max pages for deep crawl
-v, --verbose Verbose output
-h, --help Show help
```
**configs/** - Pre-made crawler configs for common patterns
**scripts/** - `batch_crawl.sh` for processing multiple URLs
Crawl is successful when:
- Output contains the actual page content (not just navigation/skeleton)
- For SPAs: body text includes expected content, not "Not found" or empty
- Markdown is clean and readable
- Structured extraction returns expected fields
**Quick validation:**
```bash
# Should return actual content, not just nav elements
crwl crawl -C configs/spa.yaml -bc -o md URL | head -50
```