# llms-txt-crawler > Fetch and crawl llms.txt files from websites. Parses the llms.txt format to extract page URLs and downloads all listed content. Use when you need to gather documentation or content from a website that provides an llms.txt file. - Author: quanganh - Repository: agykit/agykit - Version: 20260116100953 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/agykit/agykit - Web: https://mule.run/skillshub/@@agykit/agykit~llms-txt-crawler:20260116100953 --- --- name: llms-txt-crawler description: Fetch and crawl llms.txt files from websites. Parses the llms.txt format to extract page URLs and downloads all listed content. Use when you need to gather documentation or content from a website that provides an llms.txt file. compatibility: Requires Node.js 18+ and network access metadata: author: agy-kit version: "1.0" --- # llms.txt Crawler Skill This skill enables you to fetch `llms.txt` files from websites and crawl all pages listed within them. The `llms.txt` format is a standard way for websites to provide LLM-friendly content listings. ## Overview The `llms.txt` file typically follows this format: ``` # Site Name ## Section Name - [Page Title](https://example.com/page.md): Description of the page - [Another Page](https://example.com/another.md): Another description ``` This skill parses these files and downloads all linked content. ## Usage ### Basic Usage Run the crawl script with a target URL: ```bash cd /path/to/skills/llms-txt-crawler/scripts npm install # First time only node crawl.js --url https://example.com ``` ### Command Line Options | Option | Short | Description | Default | |--------|-------|-------------|---------| | `--url` | `-u` | Base URL of the site with llms.txt | Required | | `--output` | `-o` | Output directory for crawled files | `./output` | | `--format` | `-f` | Output format: `md`, `json`, or `txt` | `md` | | `--delay` | `-d` | Delay between requests in milliseconds | `500` | | `--concurrent` | `-c` | Maximum concurrent requests | `3` | ### Examples **Crawl agentskills.io documentation:** ```bash node crawl.js --url https://agentskills.io --output ./agentskills-docs ``` **Crawl with custom rate limiting:** ```bash node crawl.js --url https://example.com --delay 1000 --concurrent 2 ``` **Output as JSON:** ```bash node crawl.js --url https://example.com --format json ``` ## Output Structure The script creates the following output structure: ``` output/ ├── llms.txt # Original llms.txt file ├── index.json # Metadata about all crawled pages └── pages/ ├── page-1.md ├── page-2.md └── ... ``` ## Error Handling - **Network errors**: Retries up to 3 times with exponential backoff - **Rate limiting**: Respects delay settings between requests - **Missing pages**: Logs warnings but continues crawling other pages - **Invalid URLs**: Skips and logs invalid URLs ## Integration Tips When using this skill in an agent workflow: 1. First run the crawler to download content 2. The `index.json` file contains metadata about all pages 3. Use the downloaded markdown files for context or analysis ## See Also - [llms.txt Specification](https://llmstxt.org/) - [scripts/crawl.js](scripts/crawl.js) - The main crawler script