# docs-crawler

> Crawl documentation websites and save as clean Markdown files. Use when users need to download docs for offline use, create local documentation mirrors, or prepare docs for AI/LLM consumption. Supports recursive crawling, language filtering, and LLM-powered index generation.

- Author: Ehfaz Rezwan
- Repository: ehfazrezwan/docs-crawler
- Version: 20251222033303
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/ehfazrezwan/docs-crawler
- Web: https://mule.run/skillshub/@@ehfazrezwan/docs-crawler~docs-crawler:20251222033303

---

---
name: docs-crawler
description: Crawl documentation websites and save as clean Markdown files. Use when users need to download docs for offline use, create local documentation mirrors, or prepare docs for AI/LLM consumption. Supports recursive crawling, language filtering, and LLM-powered index generation.
---

# Docs Crawler

A recursive documentation site crawler that saves pages as Markdown files, preserving the original directory structure. Built with Crawl4AI.

## Quick Start

### Crawl a Documentation Site

```bash
# Navigate to the docs-crawler project
cd /Users/ehfaz.rezwan/Projects/docs-crawler

# Activate the virtual environment
source env/bin/activate

# Crawl a documentation site
python crawl.py https://docs.example.com
```

### Generate AI-Powered Index

```bash
# Set your LLM API key
export LLM_GATEWAY_API_KEY="your-api-key"

# Generate index with summaries and keywords
python generate_index_llm.py --docs-dir example-docs
```

## Core Capabilities

### 1. Recursive Documentation Crawling

The crawler automatically discovers and follows internal links:

```bash
# Basic crawl (auto-detects output directory from domain)
python crawl.py https://docs.python.org

# Custom output directory
python crawl.py https://docs.python.org -o python-docs

# Limit pages for testing
python crawl.py https://docs.example.com --max-pages 50
```

### 2. Markdown Conversion

Each page is converted to clean Markdown with:
- Original URL preserved as a comment
- Directory structure mirroring the site
- Clean formatting suitable for LLMs

Output structure:
```
python-docs/
├── index.md
├── quickstart.md
├── tutorial/
│   ├── index.md
│   └── basics.md
└── reference/
    └── api.md
```

### 3. Language Filtering

By default, non-English language versions are excluded:

```bash
# Include all languages
python crawl.py https://docs.example.com --include-all-langs

# Custom language exclusions
python crawl.py https://docs.example.com --exclude-langs zh,ko,ja,de
```

### 4. LLM-Powered Index Generation

Generate intelligent indexes with AI-powered summaries:

```bash
# Basic index generation
python generate_index_llm.py --docs-dir python-docs

# Custom LLM provider
export LLM_GATEWAY_BASE_URL="https://api.openai.com/v1"
export LLM_MODEL="gpt-4o-mini"
python generate_index_llm.py --docs-dir python-docs
```

Features:
- Smart content truncation (~1000 tokens per doc)
- JSON caching for resumable runs
- Batch writing for progress visibility
- Graceful error handling

## Common Workflows

### Full Pipeline: Crawl + Index

```bash
# Step 1: Crawl the documentation
python crawl.py https://docs.ultralytics.com

# Step 2: Generate AI-powered index
export LLM_GATEWAY_API_KEY="your-key"
python generate_index_llm.py --docs-dir docs-ultralytics-com
```

### Resume Interrupted Crawl

The index generator caches results, so you can resume:

```bash
# If interrupted, just run again - it will skip cached files
python generate_index_llm.py --docs-dir python-docs

# To start fresh, use --no-cache
python generate_index_llm.py --docs-dir python-docs --no-cache
```

### Prepare Docs for RAG/LLM

```bash
# Crawl with focused output
python crawl.py https://langchain.com/docs -o langchain-docs --max-pages 100

# Generate searchable index
python generate_index_llm.py --docs-dir langchain-docs --output langchain-docs/INDEX.md
```

## Configuration

### Crawler Options

| Option | Default | Description |
|--------|---------|-------------|
| `url` | (required) | Base URL of the documentation site |
| `-o, --output` | auto | Output directory (derived from domain) |
| `--max-pages` | unlimited | Maximum pages to crawl |
| `--max-concurrent` | 5 | Concurrent requests |
| `--delay` | 1.0 | Delay between batches (seconds) |
| `--exclude-langs` | zh,ko,ja,... | Languages to exclude |
| `--include-all-langs` | false | Include all languages |
| `-q, --quiet` | false | Suppress progress output |

### Index Generator Options

| Option | Default | Description |
|--------|---------|-------------|
| `--docs-dir` | docs | Directory with markdown files |
| `--output` | docs/INDEX_LLM.md | Output index file |
| `--cache` | docs/.index_cache.json | Cache file path |
| `--no-cache` | false | Ignore cache, start fresh |
| `--api-key` | env var | LLM API key |
| `--base-url` | env var | LLM API base URL |
| `--model` | env var | LLM model to use |

### Environment Variables

| Variable | Required | Description |
|----------|----------|-------------|
| `LLM_GATEWAY_API_KEY` | Yes (for index) | API key for LLM provider |
| `LLM_GATEWAY_BASE_URL` | No | Base URL for OpenAI-compatible API |
| `LLM_MODEL` | No | Model to use (default: gpt-4o-mini) |

## Best Practices

1. **Start with a page limit** when testing:
   ```bash
   python crawl.py https://docs.example.com --max-pages 10
   ```

2. **Use caching during development** - The index generator caches results automatically

3. **Respect rate limits** - Default settings are conservative; adjust `--delay` if needed

4. **Check the crawl summary** - Look at `_crawl_summary.txt` for statistics

5. **Use quiet mode for scripts**:
   ```bash
   python crawl.py https://docs.example.com -q
   ```

## Troubleshooting

### Pages not being crawled

Check if they're being filtered:
- Language paths (e.g., `/zh/`, `/ko/`)
- Asset paths (`/assets/`, `/static/`)
- Non-HTML extensions (`.png`, `.pdf`)

### JavaScript content not loading

The crawler uses Playwright and should handle JS. If content is missing:
- Try increasing `--delay` for slow-loading sites
- Check if the site requires authentication

### Index generation fails

1. Verify API key is set: `echo $LLM_GATEWAY_API_KEY`
2. Check the cache file for partial results
3. Use `--no-cache` to start fresh

## Resources

- [CLI Reference](references/cli-reference.md) - Detailed command-line documentation
- [Workflow Examples](examples/workflows.md) - Common usage patterns
- [README.md](../../README.md) - Project documentation

## Dependencies

- **crawl4ai** - Web crawling engine
- **langchain-openai** - LLM integration (for index generation)
- **playwright** - Browser automation

Install with:
```bash
pip install -r requirements.txt
playwright install chromium
```