# website-crawler > Crawl and ingest websites into whorl. Use when scraping a personal site, blog, or extracting web content for the knowledge base. - Author: Uzay-G - Repository: Uzay-G/whorl - Version: 20251229202202 - Stars: 1 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/Uzay-G/whorl - Web: https://mule.run/skillshub/@@Uzay-G/whorl~website-crawler:20251229202202 --- --- name: website-crawler description: Crawl and ingest websites into whorl. Use when scraping a personal site, blog, or extracting web content for the knowledge base. allowed-tools: Bash, Read, Write --- # Website Crawler for Whorl Crawl websites and ingest content into your whorl knowledge base. ## Prerequisites Install trafilatura if not already available: ```bash pip install trafilatura ``` ## Single Page Extract a single page and save to whorl docs: ```bash # Extract content as markdown trafilatura -u "https://example.com/page" --markdown > ~/.whorl/docs/page-name.md ``` Or with metadata in frontmatter: ```bash URL="https://example.com/page" SLUG=$(echo "$URL" | sed 's|https\?://||; s|/|_|g; s|_$||') OUTPUT=~/.whorl/docs/"$SLUG".md # Fetch and extract CONTENT=$(trafilatura -u "$URL" --markdown) TITLE=$(trafilatura -u "$URL" --json | python3 -c "import sys,json; print(json.load(sys.stdin).get('title','Untitled'))" 2>/dev/null || echo "Untitled") # Write with frontmatter cat > "$OUTPUT" << EOF --- title: "$TITLE" source_url: $URL fetched_at: $(date -u +%Y-%m-%dT%H:%M:%SZ) --- $CONTENT EOF echo "Saved to $OUTPUT" ``` ## Crawl Entire Site Crawl up to 30 pages from a site: ```bash trafilatura --crawl "https://example.com" --markdown -o ~/.whorl/docs/site-name/ ``` Or with sitemap: ```bash trafilatura --sitemap "https://example.com/sitemap.xml" --markdown -o ~/.whorl/docs/site-name/ ``` ## Crawl with Custom Limit For more control, use Python: ```python import os from pathlib import Path from datetime import datetime, timezone import trafilatura from trafilatura.spider import focused_crawler WHORL_DOCS = Path.home() / ".whorl" / "docs" site_dir = WHORL_DOCS / "my-site" site_dir.mkdir(parents=True, exist_ok=True) for url in focused_crawler("https://example.com", max_seen_urls=50): downloaded = trafilatura.fetch_url(url) if not downloaded: continue content = trafilatura.extract(downloaded, output_format='markdown') metadata = trafilatura.extract_metadata(downloaded) if not content: continue # Generate filename from URL slug = url.split("//")[-1].replace("/", "_").rstrip("_")[:80] filepath = site_dir / f"{slug}.md" # Write with frontmatter title = metadata.title if metadata else "Untitled" frontmatter = f"""--- title: "{title}" source_url: {url} fetched_at: {datetime.now(timezone.utc).isoformat()} --- """ filepath.write_text(frontmatter + content) print(f"+ {filepath.name}") ``` ## After Crawling Run whorl sync to process new documents with ingestion agents: ```bash whorl sync ``` Or if running locally without auth: ```bash curl -X POST http://localhost:8000/api/sync ``` ## Tips - **Rate limiting**: trafilatura respects robots.txt and has built-in politeness - **Deduplication**: whorl's hash index will detect duplicate content - **Binary files**: PDFs and images should be downloaded separately with `curl -O` - **Large sites**: Use `max_seen_urls` to limit scope, or target specific sitemaps