# web-to-rag

> Import content into local RAG. Triggers: "add to RAG", "scrape docs",
"import YouTube", "embed PDF", "build knowledge base", "embedda sito",
"importa video", "aggiungi PDF", "crea knowledge base".

- Author: Tapiocapioca
- Repository: Tapiocapioca/my-skills
- Version: 20260104215933
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/Tapiocapioca/my-skills
- Web: https://mule.run/skillshub/@@Tapiocapioca/my-skills~web-to-rag:20260104215933

---

---
name: web-to-rag
description: |
  Import content into local RAG. Triggers: "add to RAG", "scrape docs",
  "import YouTube", "embed PDF", "build knowledge base", "embedda sito",
  "importa video", "aggiungi PDF", "crea knowledge base".
---

# Web to RAG

Import websites, YouTube videos, and PDFs into AnythingLLM.

## Quick Reference

| Source | Detection | Tool |
|--------|-----------|------|
| YouTube | `youtube.com`, `youtu.be` | yt-dlp-server:8501 |
| PDF | `.pdf` extension | pdftotext |
| Docs site | `/sitemap.xml` exists | Crawl4AI sitemap |
| Generic | Everything else | Crawl4AI BFS |

| MCP Tool | Purpose |
|----------|---------|
| `mcp__crawl4ai__md` | Scrape page to markdown |
| `mcp__anythingllm__list_workspaces` | List workspaces |
| `mcp__anythingllm__embed_text` | Add content to workspace |
| `mcp__anythingllm__chat_with_workspace` | Query RAG (mode: "query") |

## Prerequisites

```bash
# Verify containers
docker ps | grep -E "crawl4ai|anythingllm"

# Start if needed
docker start crawl4ai anythingllm

# YouTube/audio support
docker start yt-dlp-server whisper-server
```

For "Client not initialized" error:
```
mcp__anythingllm__initialize_anythingllm
  apiKey: "YOUR_API_KEY"
  baseUrl: "http://localhost:3001"
```

## Core Workflow

```
1. Detect content type from URL
2. Find or create workspace (name from domain)
3. Crawl content (max 3 parallel)
4. Embed with metadata
5. Report results with test query
```

### Rate Limits

- 3 URLs maximum per batch
- Wait before next batch
- Confirm if > 50 pages

### Workspace Naming

| URL | Workspace |
|-----|-----------|
| docs.anthropic.com | anthropic-docs |
| react.dev/learn | react-dev |
| example.com/blog | example-blog |

## Embedding Format

```markdown
---
source: {url}
title: {page_title}
scraped: {timestamp}
---
{cleaned_content}
```

## Common Mistakes

| Mistake | Fix |
|---------|-----|
| Rate limit exceeded | Max 3 parallel, wait between batches |
| Empty content embedded | Skip empty pages |
| Wrong chat mode | Use `mode: "query"` |
| Uninitialized client | Call `initialize_anythingllm` first |
| External links crawled | Filter to same domain |

## Error Handling

| Error | Action |
|-------|--------|
| 403 Forbidden | Use Playwright fallback |
| 429 Rate Limited | Pause 60s, reduce parallelism |
| Timeout | Retry once, skip |
| Empty content | Skip |

## Extended Workflows

See references/:
- [youtube-workflow.md](references/youtube-workflow.md) - YouTube transcripts
- [pdf-workflow.md](references/pdf-workflow.md) - PDF extraction
- [crawl-strategies.md](references/crawl-strategies.md) - Advanced crawling
- [interactive-mode.md](references/interactive-mode.md) - Page selection
- [scheduling.md](references/scheduling.md) - Automatic updates
- [troubleshooting.md](references/troubleshooting.md) - Problem solving

## Report Template

```markdown
## Scraping Report

**URL:** {url}
**Type:** {docs|generic|youtube|pdf}
**Workspace:** {name}

### Stats
- Found: X
- Processed: Y
- Failed: Z

### Test Query
mcp__anythingllm__chat_with_workspace
  slug: "{workspace}"
  message: "What does this documentation cover?"
  mode: "query"
```