# web-scraper > Modern web scraping with structured data extraction. Fetch web pages, extract content using CSS selectors, parse structured data (JSON-LD, Open Graph, meta tags), and handle pagination. - Author: Sahiix@1 - Repository: sahiixx/moltworker - Version: 20260201214731 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/sahiixx/moltworker - Web: https://mule.run/skillshub/@@sahiixx/moltworker~web-scraper:20260201214731 --- --- name: web-scraper description: Modern web scraping with structured data extraction. Fetch web pages, extract content using CSS selectors, parse structured data (JSON-LD, Open Graph, meta tags), and handle pagination. --- # Web Scraper Modern web scraping with intelligent content extraction. ## Quick Start ### Fetch and Extract ```bash node /path/to/skills/web-scraper/scripts/fetch.js https://example.com ``` ### Extract with Selectors ```bash node /path/to/skills/web-scraper/scripts/extract.js https://example.com --selector "h1,h2,p" ``` ### Get Structured Data ```bash node /path/to/skills/web-scraper/scripts/metadata.js https://example.com ``` ## Scripts ### fetch.js Fetch web page content with smart extraction. **Usage:** ```bash node fetch.js [OPTIONS] ``` **Options:** - `--output ` - Output format: text, html, markdown (default: text) - `--timeout ` - Request timeout (default: 30000) - `--user-agent ` - Custom User-Agent string - `--headers ` - Custom headers as JSON - `--follow` - Follow redirects (default: true) ### extract.js Extract specific elements using CSS selectors. **Usage:** ```bash node extract.js --selector [OPTIONS] ``` **Options:** - `--selector ` - CSS selector (required) - `--attr ` - Extract attribute instead of text - `--multiple` - Return all matches (default: first only) - `--json` - Output as JSON array ### metadata.js Extract structured metadata from pages. **Usage:** ```bash node metadata.js [OPTIONS] ``` **Extracts:** - Title, description, canonical URL - Open Graph tags (og:title, og:image, etc.) - Twitter Card data - JSON-LD structured data - Meta tags ### links.js Extract and analyze links from a page. **Usage:** ```bash node links.js [OPTIONS] ``` **Options:** - `--internal` - Only internal links - `--external` - Only external links - `--filter ` - Filter by URL pattern - `--format ` - Output: json, csv, list ### sitemap.js Parse and process XML sitemaps. **Usage:** ```bash node sitemap.js [OPTIONS] ``` **Options:** - `--discover` - Auto-discover sitemap from robots.txt - `--filter ` - Filter URLs by pattern - `--limit ` - Limit number of URLs ## Examples ### Extract Article Content ```bash node fetch.js https://blog.example.com/article --output markdown ``` ### Get All Product Links ```bash node extract.js https://shop.example.com --selector "a.product-link" --attr href --multiple ``` ### Extract Open Graph Data ```bash node metadata.js https://example.com ``` Output: ```json { "title": "Example Page", "description": "Page description", "openGraph": { "title": "Example OG Title", "image": "https://example.com/image.jpg", "type": "website" } } ``` ### Get External Links ```bash node links.js https://example.com --external --format csv ``` ### Process Sitemap ```bash node sitemap.js https://example.com/sitemap.xml --filter "/blog/" ``` ## Output Formats ### fetch.js (markdown) ```markdown # Page Title Main content extracted and converted to markdown... ## Section Heading Paragraph text with [links](https://example.com). ``` ### extract.js (JSON) ```json { "url": "https://example.com", "selector": "h2", "matches": [ { "text": "First Heading", "html": "

First Heading

" }, { "text": "Second Heading", "html": "

Second Heading

" } ], "count": 2 } ``` ### metadata.js ```json { "url": "https://example.com", "title": "Page Title", "description": "Meta description", "canonical": "https://example.com/page", "openGraph": { "title": "OG Title", "description": "OG Description", "image": "https://example.com/og-image.jpg", "type": "article" }, "twitterCard": { "card": "summary_large_image", "site": "@example" }, "jsonLd": [ { "@type": "Article", "headline": "Article Title" } ] } ``` ## Best Practices - **Respect robots.txt**: Check before scraping - **Rate limiting**: Add delays between requests - **User-Agent**: Use descriptive UA strings - **Caching**: Cache responses when appropriate - **Error handling**: Handle 404s, timeouts gracefully ## Notes - For JavaScript-rendered pages, use the `cloudflare-browser` skill - This skill uses HTTP fetch (no JavaScript execution) - Some sites may block automated requests