# docs-crawler > Crawl documentation websites and save as clean Markdown files. Use when users need to download docs for offline use, create local documentation mirrors, or prepare docs for AI/LLM consumption. Supports recursive crawling, language filtering, and LLM-powered index generation. - Author: Ehfaz Rezwan - Repository: ehfazrezwan/docs-crawler - Version: 20251222033303 - Stars: 1 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/ehfazrezwan/docs-crawler - Web: https://mule.run/skillshub/@@ehfazrezwan/docs-crawler~docs-crawler:20251222033303 --- --- name: docs-crawler description: Crawl documentation websites and save as clean Markdown files. Use when users need to download docs for offline use, create local documentation mirrors, or prepare docs for AI/LLM consumption. Supports recursive crawling, language filtering, and LLM-powered index generation. --- # Docs Crawler A recursive documentation site crawler that saves pages as Markdown files, preserving the original directory structure. Built with Crawl4AI. ## Quick Start ### Crawl a Documentation Site ```bash # Navigate to the docs-crawler project cd /Users/ehfaz.rezwan/Projects/docs-crawler # Activate the virtual environment source env/bin/activate # Crawl a documentation site python crawl.py https://docs.example.com ``` ### Generate AI-Powered Index ```bash # Set your LLM API key export LLM_GATEWAY_API_KEY="your-api-key" # Generate index with summaries and keywords python generate_index_llm.py --docs-dir example-docs ``` ## Core Capabilities ### 1. Recursive Documentation Crawling The crawler automatically discovers and follows internal links: ```bash # Basic crawl (auto-detects output directory from domain) python crawl.py https://docs.python.org # Custom output directory python crawl.py https://docs.python.org -o python-docs # Limit pages for testing python crawl.py https://docs.example.com --max-pages 50 ``` ### 2. Markdown Conversion Each page is converted to clean Markdown with: - Original URL preserved as a comment - Directory structure mirroring the site - Clean formatting suitable for LLMs Output structure: ``` python-docs/ ├── index.md ├── quickstart.md ├── tutorial/ │ ├── index.md │ └── basics.md └── reference/ └── api.md ``` ### 3. Language Filtering By default, non-English language versions are excluded: ```bash # Include all languages python crawl.py https://docs.example.com --include-all-langs # Custom language exclusions python crawl.py https://docs.example.com --exclude-langs zh,ko,ja,de ``` ### 4. LLM-Powered Index Generation Generate intelligent indexes with AI-powered summaries: ```bash # Basic index generation python generate_index_llm.py --docs-dir python-docs # Custom LLM provider export LLM_GATEWAY_BASE_URL="https://api.openai.com/v1" export LLM_MODEL="gpt-4o-mini" python generate_index_llm.py --docs-dir python-docs ``` Features: - Smart content truncation (~1000 tokens per doc) - JSON caching for resumable runs - Batch writing for progress visibility - Graceful error handling ## Common Workflows ### Full Pipeline: Crawl + Index ```bash # Step 1: Crawl the documentation python crawl.py https://docs.ultralytics.com # Step 2: Generate AI-powered index export LLM_GATEWAY_API_KEY="your-key" python generate_index_llm.py --docs-dir docs-ultralytics-com ``` ### Resume Interrupted Crawl The index generator caches results, so you can resume: ```bash # If interrupted, just run again - it will skip cached files python generate_index_llm.py --docs-dir python-docs # To start fresh, use --no-cache python generate_index_llm.py --docs-dir python-docs --no-cache ``` ### Prepare Docs for RAG/LLM ```bash # Crawl with focused output python crawl.py https://langchain.com/docs -o langchain-docs --max-pages 100 # Generate searchable index python generate_index_llm.py --docs-dir langchain-docs --output langchain-docs/INDEX.md ``` ## Configuration ### Crawler Options | Option | Default | Description | |--------|---------|-------------| | `url` | (required) | Base URL of the documentation site | | `-o, --output` | auto | Output directory (derived from domain) | | `--max-pages` | unlimited | Maximum pages to crawl | | `--max-concurrent` | 5 | Concurrent requests | | `--delay` | 1.0 | Delay between batches (seconds) | | `--exclude-langs` | zh,ko,ja,... | Languages to exclude | | `--include-all-langs` | false | Include all languages | | `-q, --quiet` | false | Suppress progress output | ### Index Generator Options | Option | Default | Description | |--------|---------|-------------| | `--docs-dir` | docs | Directory with markdown files | | `--output` | docs/INDEX_LLM.md | Output index file | | `--cache` | docs/.index_cache.json | Cache file path | | `--no-cache` | false | Ignore cache, start fresh | | `--api-key` | env var | LLM API key | | `--base-url` | env var | LLM API base URL | | `--model` | env var | LLM model to use | ### Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `LLM_GATEWAY_API_KEY` | Yes (for index) | API key for LLM provider | | `LLM_GATEWAY_BASE_URL` | No | Base URL for OpenAI-compatible API | | `LLM_MODEL` | No | Model to use (default: gpt-4o-mini) | ## Best Practices 1. **Start with a page limit** when testing: ```bash python crawl.py https://docs.example.com --max-pages 10 ``` 2. **Use caching during development** - The index generator caches results automatically 3. **Respect rate limits** - Default settings are conservative; adjust `--delay` if needed 4. **Check the crawl summary** - Look at `_crawl_summary.txt` for statistics 5. **Use quiet mode for scripts**: ```bash python crawl.py https://docs.example.com -q ``` ## Troubleshooting ### Pages not being crawled Check if they're being filtered: - Language paths (e.g., `/zh/`, `/ko/`) - Asset paths (`/assets/`, `/static/`) - Non-HTML extensions (`.png`, `.pdf`) ### JavaScript content not loading The crawler uses Playwright and should handle JS. If content is missing: - Try increasing `--delay` for slow-loading sites - Check if the site requires authentication ### Index generation fails 1. Verify API key is set: `echo $LLM_GATEWAY_API_KEY` 2. Check the cache file for partial results 3. Use `--no-cache` to start fresh ## Resources - [CLI Reference](references/cli-reference.md) - Detailed command-line documentation - [Workflow Examples](examples/workflows.md) - Common usage patterns - [README.md](../../README.md) - Project documentation ## Dependencies - **crawl4ai** - Web crawling engine - **langchain-openai** - LLM integration (for index generation) - **playwright** - Browser automation Install with: ```bash pip install -r requirements.txt playwright install chromium ```