# webpage-to-md

> Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format.

- Author: fenix-wangminle
- Repository: wangminle/skills-webpage-to-md
- Version: 20260126114316
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/wangminle/skills-webpage-to-md
- Web: https://mule.run/skillshub/@@wangminle/skills-webpage-to-md~webpage-to-md:20260126114316

---

---
name: webpage-to-md
description: Web scraping and Markdown conversion toolkit for extracting web content with images. Use when Claude needs to: (1) Save web articles/blogs as Markdown files, (2) Export WeChat articles (mp.weixin.qq.com), (3) Batch crawl Wiki sites and merge into single document, (4) Download webpage images locally, (5) Convert HTML tables/code blocks to Markdown format.
---

# Web to Markdown Grabber

Extract web content and convert to clean Markdown with local images.

## Script Location

This skill includes a Python script at `scripts/grab_web_to_md.py`.

When using this skill, replace `SKILL_DIR` with the actual skill installation path:
- Claude Code: `~/.claude/skills/webpage-to-md/`
- Cursor: `~/.cursor/skills/webpage-to-md/` (if installed there)

## Quick Start

```bash
# Single page export
python SKILL_DIR/scripts/grab_web_to_md.py "https://example.com/article" --out output.md --validate

# WeChat article (auto-detected)
python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" --out article.md

# Wiki batch crawl + merge
python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
  --crawl --crawl-pattern 'page=' \
  --merge --toc --merge-output wiki.md
```

## Core Parameters

| Parameter | Purpose | Example |
|-----------|---------|---------|
| `--out` | Output file path | `--out docs/article.md` |
| `--validate` | Verify image integrity | `--validate` |
| `--keep-html` | Preserve complex tables | `--keep-html` |
| `--tags` | Add YAML frontmatter tags | `--tags "ai,tutorial"` |

## Three Main Use Cases

### 1. Single Page Export (Blog/News)

```bash
python SKILL_DIR/scripts/grab_web_to_md.py "URL" \
  --out output.md \
  --keep-html \
  --tags "topic1,topic2" \
  --validate
```

**Auto behavior**: Downloads images to `output.assets/`, generates YAML frontmatter.

### 2. WeChat Article Export

```bash
python SKILL_DIR/scripts/grab_web_to_md.py "https://mp.weixin.qq.com/s/xxx" \
  --out article.md
```

**Auto behavior**: Detects WeChat URL → extracts `rich_media_content` → cleans interaction buttons.

### 3. Wiki Batch Crawl + Merge

```bash
python SKILL_DIR/scripts/grab_web_to_md.py "https://wiki.example.com/index" \
  --crawl \
  --crawl-pattern 'page=wiki' \
  --merge \
  --toc \
  --merge-output wiki_guide.md \
  --target-id body \
  --clean-wiki-noise \
  --rewrite-links \
  --download-images
```

**Parameters explained**:
- `--crawl`: Extract links from index page
- `--crawl-pattern`: Regex to filter content pages
- `--merge --toc`: Combine into single file with TOC
- `--target-id body`: Extract only main content area
- `--clean-wiki-noise`: Remove edit buttons, navigation links
- `--rewrite-links`: Convert external URLs to internal anchors
- `--download-images`: Save images locally

## Content Extraction Parameters

| Parameter | Purpose |
|-----------|---------|
| `--target-id ID` | Extract element by id (e.g., `body`, `content`) |
| `--target-class CLASS` | Extract element by class (e.g., `article-body`) |
| `--clean-wiki-noise` | Remove Wiki system noise (PukiWiki/MediaWiki) |
| `--wechat` | Force WeChat article mode |

## Batch Processing Parameters

| Parameter | Default | Purpose |
|-----------|---------|---------|
| `--urls-file` | - | Read URLs from file |
| `--max-workers` | 3 | Concurrent threads |
| `--delay` | 1.0 | Request interval (seconds) |
| `--skip-errors` | False | Continue on failures |
| `--download-images` | False | Download images locally |

## Anti-Scraping Support

```bash
# With cookies
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --cookie "session=xxx"

# With custom headers
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --header "Authorization: Bearer xxx"

# Change User-Agent
python SKILL_DIR/scripts/grab_web_to_md.py "URL" --ua-preset firefox-win
```

## Output Structure

```
output.md                 # Markdown file
output.assets/            # Images directory
  ├── 01-hero.png
  └── 02-diagram.jpg
output.md.assets.json     # URL→local mapping
```

## Common Site Configurations

| Site Type | Recommended Parameters |
|-----------|----------------------|
| PukiWiki | `--target-id body --clean-wiki-noise` |
| MediaWiki | `--target-id content --clean-wiki-noise` |
| WordPress | `--target-class entry-content` |
| WeChat | Auto-detected, or `--wechat` |
| Tech Blog | `--keep-html --tags` |

## Dependencies

- **Required**: `requests` (HTTP requests)
- **Optional**: `markdown` (for PDF export with `--with-pdf`)

Install: `pip install requests`

## References

For complete documentation, see [references/full-guide.md](references/full-guide.md):
- All parameter explanations with defaults
- 9 usage scenarios with examples
- 3 detailed real-world cases
- Output structure diagrams
- Technical implementation details
- Changelog history