# webpage-to-markdown > Extract webpage content and convert to Markdown format using Playwright browser automation. Use when you need to fetch and convert web content to Markdown, handle JavaScript-rendered pages, or preserve webpage formatting as Markdown. Supports extracting text, images, links, and formatting from any URL. - Author: super-people - Repository: stevelin001/gl-ai-tools-plugin - Version: 20260103165004 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/stevelin001/gl-ai-tools-plugin - Web: https://mule.run/skillshub/@@stevelin001/gl-ai-tools-plugin~webpage-to-markdown:20260103165004 --- --- name: webpage-to-markdown description: Extract webpage content and convert to Markdown format using Playwright browser automation. Use when you need to fetch and convert web content to Markdown, handle JavaScript-rendered pages, or preserve webpage formatting as Markdown. Supports extracting text, images, links, and formatting from any URL. --- # Webpage to Markdown Converter ## Overview Convert webpage content to Markdown format using Playwright browser automation. This skill uses a real browser to handle JavaScript-rendered content, ensuring accurate extraction of dynamic webpages. **Key capabilities:** - Fetch content from JavaScript-rendered pages - Preserve text formatting, links, images, and structure - Customize conversion options (heading styles, image inclusion, etc.) - Handle timeouts and wait conditions for slow-loading pages ## When to Use This Skill **Use this skill when:** - Converting webpage content to Markdown for documentation - Extracting article/blog post content in readable format - Archiving web content in a portable format - Processing JavaScript-heavy single-page applications (SPAs) **Do NOT use this skill for:** - Testing web applications (use `webapp-testing` skill instead) - Taking screenshots of webpages (use screenshot tools) - Scraping structured data (consider dedicated scraping tools) - Accessing APIs (use direct API requests) ## Quick Start ### Basic Usage ```bash # Convert webpage to Markdown (output to stdout) python3 scripts/fetch_as_markdown.py https://example.com # Save to file python3 scripts/fetch_as_markdown.py https://example.com -o output.md # See all options python3 scripts/fetch_as_markdown.py --help ``` ### Common Patterns **1. Convert blog post:** ```bash python3 scripts/fetch_as_markdown.py https://blog.example.com/article -o article.md ``` **2. Handle slow-loading page:** ```bash python3 scripts/fetch_as_markdown.py https://slow-site.com \ --timeout 60000 --wait-for networkidle --verbose ``` **3. Batch conversion:** ```bash for url in $(cat urls.txt); do filename=$(echo "$url" | sed 's|https://||' | sed 's|/|_|g').md python3 scripts/fetch_as_markdown.py "$url" -o "$filename" done ``` ## Options Reference ### Required Arguments - `url` - URL to fetch (must start with http:// or https://) ### Optional Arguments - `-o, --output FILE` - Save to file instead of stdout - `--timeout MS` - Page load timeout in milliseconds (default: 30000) - `--wait-for STATE` - Wait condition: load, domcontentloaded, networkidle (default: networkidle) - `--no-strip-scripts` - Keep script/style tags (default: remove them) - `--no-images` - Exclude image references (default: include them) - `--heading-style STYLE` - Heading style: atx (#) or underlined (default: atx) - `--verbose` - Print verbose output to stderr ### Wait States Explained - **load**: Wait for 'load' event (fast, may miss dynamic content) - **domcontentloaded**: Wait for DOM ready (medium, misses async content) - **networkidle**: Wait for no network activity (slow, most complete) **Recommendation**: Use `networkidle` (default) for JavaScript-heavy sites. ## Troubleshooting **Issue: "Error fetching webpage: Timeout"** - Increase timeout: `--timeout 60000` (60 seconds) - Check URL is accessible - Try different wait state: `--wait-for load` **Issue: "Extracted Markdown is incomplete"** - Use `--wait-for networkidle` to ensure full page load - Add `--verbose` to see execution details - Check if content requires authentication **Issue: "Too much unwanted content"** - Script automatically removes common non-content elements (nav, footer, ads) - Post-process Markdown to remove specific sections - See [references/examples.md](references/examples.md) for custom filtering **Issue: "command not found: python"** - Use `python3` instead of `python` - Or create alias: `alias python=python3` ### Debugging Use `--verbose` flag for detailed execution steps: ```bash python3 scripts/fetch_as_markdown.py https://example.com --verbose ``` ## Advanced Usage For complex scenarios and edge case handling, see [references/examples.md](references/examples.md): - Authentication and cookies - Custom element filtering - Batch processing strategies - Performance optimization ## Dependencies Required packages: ```bash pip install playwright markdownify beautifulsoup4 playwright install chromium ``` **Requirements:** - Python 3.8+ - playwright>=1.40.0 - markdownify>=0.11.0 - beautifulsoup4>=4.12.0 **Installation:** ```bash # Install Python packages pip3 install playwright markdownify beautifulsoup4 # Install Playwright browsers (required) python3 -m playwright install chromium ``` ## Resources This skill includes: ### scripts/ - `fetch_as_markdown.py` - Main conversion script with Playwright automation ### references/ - `examples.md` - Advanced usage patterns and edge cases