# crawl4ai > Web crawling and data extraction using the crwl CLI. Use this skill when users need to scrape websites, extract markdown content, handle JavaScript-heavy pages, or extract structured data from web pages. - Author: Marcel Hild - Repository: durandom/skills - Version: 20260206095937 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/durandom/skills - Web: https://mule.run/skillshub/@@durandom/skills~crawl4ai:20260206095937 --- --- name: crawl4ai description: Web crawling and data extraction using the crwl CLI. Use this skill when users need to scrape websites, extract markdown content, handle JavaScript-heavy pages, or extract structured data from web pages. version: 1.0.0 crawl4ai_cli: crwl last_updated: 2025-01-19 --- Extract web content as clean markdown using the `crwl` CLI. This skill handles static sites, JavaScript-rendered SPAs, and structured data extraction without requiring Python code. **Primary use cases:** - Convert documentation sites to markdown - Scrape JavaScript-heavy pages (React, Vue, Obsidian Publish) - Extract structured data from e-commerce, news, or catalog sites - Batch process multiple URLs **Basic crawl:** ```bash crwl crawl -o md https://example.com ``` **Save to file:** ```bash crwl crawl -o md -O output.md https://example.com ``` **JavaScript-heavy site (use pre-made config):** ```bash crwl crawl -C configs/spa.yaml -bc -o md "https://spa-site.com" ``` ```bash # Using uv (recommended) - include [all] for ML features uv tool install 'crawl4ai[all]' # Or with pip pip install 'crawl4ai[all]' # Run setup after installation crawl4ai-setup # Verify installation crwl --help crawl4ai-doctor ``` > **Note:** The `[all]` extra is required for ML features. Without it, `crawl4ai-download-models` will fail. **Markdown extraction:** ```bash crwl crawl -o md https://docs.example.com # Basic crwl crawl -o md-fit https://docs.example.com # Filtered (removes boilerplate) crwl crawl -o md -O docs.md https://docs.example.com # Save to file ``` **Structured data extraction:** ```bash crwl crawl -j "extract product names and prices" https://shop.com # LLM-based crwl crawl -s schema.json https://shop.com # Schema-based (faster) ``` **Deep crawling:** ```bash crwl crawl --deep-crawl bfs --max-pages 10 https://docs.example.com # Breadth-first crwl crawl --deep-crawl dfs --max-pages 5 https://blog.example.com # Depth-first ``` **Question answering:** ```bash crwl crawl -q "What are the main features?" https://product.example.com ``` **Critical pattern for JavaScript-heavy sites (React, Vue, Obsidian Publish, etc.):** CSS selectors alone don't work—the element exists but content loads asynchronously. Use a **JS wait condition** that checks for actual rendered text. **Use the pre-made SPA config:** ```bash crwl crawl -C configs/spa.yaml -bc -o md "https://spa-site.com" ``` **Or create a site-specific config:** ```yaml # my_site_config.yaml page_timeout: 60000 wait_for: "js:() => document.body.innerText.includes('expected content')" ``` ```bash crwl crawl -C my_site_config.yaml -bc -o md "https://my-spa-site.com" ``` **Why `-bc` (bypass cache)?** SPA content can be cached before JS renders. Always use `-bc` for SPAs. **Pre-made configs in `configs/` folder:** | Config | Use Case | |--------|----------| | `spa.yaml` | Generic SPA (waits for content length > 1000) | | `obsidian-publish.yaml` | Obsidian Publish sites | | `slow-site.yaml` | Sites with slow loading (60s timeout) | **Example usage:** ```bash # Obsidian Publish site crwl crawl -C configs/obsidian-publish.yaml -bc -o md "https://plugins.javalent.com/statblocks/readme/bestiary" # Generic SPA crwl crawl -C configs/spa.yaml -bc -o md "https://react-app.example.com" ``` | Format | Flag | Description | |--------|------|-------------| | Markdown | `-o md` | Clean markdown | | Filtered markdown | `-o md-fit` | Removes navigation, ads, boilerplate | | JSON | `-o json` | Full JSON with metadata, links, media | | All | `-o all` | Everything (default) | **Crawler parameters (`-c`):** ```bash crwl crawl -c wait_for=css:.content URL # Wait for element crwl crawl -c page_timeout=60000 URL # Extended timeout (ms) crwl crawl -c remove_overlay_elements=true URL # Remove popups ``` **Browser parameters (`-b`):** ```bash crwl crawl -b viewport_width=1920,viewport_height=1080 URL # Custom viewport crwl crawl -b user_agent="Mozilla/5.0..." URL # Custom user agent ``` **Config files (`-C`, `-B`):** ```bash crwl crawl -C crawler_config.yaml URL # Crawler config crwl crawl -B browser_config.yaml URL # Browser config ``` **Example schema.json for structured extraction:** ```json { "name": "products", "baseSelector": "div.product", "fields": [ {"name": "title", "selector": "h2", "type": "text"}, {"name": "price", "selector": ".price", "type": "text"}, {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"} ] } ``` **Usage:** ```bash crwl crawl -s schema.json https://shop.com ``` For multiple URLs, use the provided script: ```bash ./scripts/batch_crawl.sh urls.txt output_dir ``` **Or inline:** ```bash while read url; do crwl crawl -o md -O "output/$(echo $url | sed 's|https://||;s|/|_|g').md" "$url" done < urls.txt ``` **Symptom:** Getting navigation/skeleton but no content **Solution:** Use SPA config with JS wait condition: ```bash crwl crawl -C configs/spa.yaml -bc -o md URL ``` If still failing, create site-specific config: ```yaml page_timeout: 60000 wait_for: "js:() => document.body.innerText.includes('expected text')" ``` **Symptom:** Empty response or access denied **Solution:** Use realistic browser settings: ```bash crwl crawl \ -b viewport_width=1920,viewport_height=1080 \ -b user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)" \ URL ``` **Symptom:** Minimal or no content extracted **Solution:** Check with verbose mode, then try waiting for content: ```bash crwl crawl -v URL crwl crawl -c wait_for=css:main URL ``` ``` crwl crawl [OPTIONS] URL Options: -o, --output [md|md-fit|json|all] Output format -O, --output-file PATH Save to file -C, --crawler-config PATH Crawler config file (YAML/JSON) -B, --browser-config PATH Browser config file (YAML/JSON) -s, --schema PATH JSON schema for extraction -j, --json-extract TEXT LLM extraction instruction -q, --question TEXT Ask question about content -c, --crawler TEXT Crawler params (key=value) -b, --browser TEXT Browser params (key=value) -bc, --bypass-cache Skip cache --deep-crawl [bfs|dfs|best-first] Multi-page crawling --max-pages INTEGER Max pages for deep crawl -v, --verbose Verbose output -h, --help Show help ``` **configs/** - Pre-made crawler configs for common patterns **scripts/** - `batch_crawl.sh` for processing multiple URLs Crawl is successful when: - Output contains the actual page content (not just navigation/skeleton) - For SPAs: body text includes expected content, not "Not found" or empty - Markdown is clean and readable - Structured extraction returns expected fields **Quick validation:** ```bash # Should return actual content, not just nav elements crwl crawl -C configs/spa.yaml -bc -o md URL | head -50 ```