# xiaohongshu-scraper > Scrapes Xiaohongshu (Little Red Book) posts to extract content, author info, and engagement metrics (likes, comments, shares, favorites). Use when user wants to scrape Xiaohongshu posts, extract post data, or analyze Xiaohongshu content. - Author: dingbohan89-coder - Repository: dingbohan89-coder/dingbohan123 - Version: 20260105185550 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/dingbohan89-coder/dingbohan123 - Web: https://mule.run/skillshub/@@dingbohan89-coder/dingbohan123~xiaohongshu-scraper:20260105185550 --- --- name: xiaohongshu-scraper description: Scrapes Xiaohongshu (Little Red Book) posts to extract content, author info, and engagement metrics (likes, comments, shares, favorites). Use when user wants to scrape Xiaohongshu posts, extract post data, or analyze Xiaohongshu content. --- # Xiaohongshu Post Scraper ## Quick Start Scrape a Xiaohongshu post by URL: ```bash node scripts/scrape.js "https://www.xiaohongshu.com/discovery/item/POST_ID" ``` Output saved to: `results/xhs_post_TIMESTAMP.json` ## First Time Setup ### Install Dependencies ```bash cd xiaohongshu-skill npm install ``` ### Login (Required) ```bash node scripts/login.js ``` This opens a browser window. Complete the login process manually. Cookies are saved automatically for future use. ## How It Works The scraper uses Playwright to automate a real browser: 1. Loads saved cookies (or prompts for login) 2. Opens the post URL in a browser 3. Waits for page to fully load (20 seconds) 4. Extracts data from the page DOM 5. Saves results to JSON file 6. Updates cookies for next use ## Available Scripts ### scrape.js - Main Scraper Scrapes a single post: ```bash node scripts/scrape.js "POST_URL" ``` **Extracts:** - Title - Author - Content/body text - Images (URLs) - Likes (点赞数) - Favorites (收藏数) - Comments (评论数) - Shares (分享数) **Output format:** ```json { "url": "post_url", "timestamp": "2025-12-31T...", "data": { "title": "post_title", "author": "author_name", "content": "post_content", "likes": "1.4万", "collects": "number", "comments": "number", "shares": "number", "images": ["url1", "url2", ...] } } ``` ### get-stats.js - Statistics Only Fetches engagement metrics only: ```bash node scripts/get-stats.js "POST_URL" ``` **Output:** ```json { "url": "post_url", "timestamp": "2025-12-31T...", "stats": { "likes": "1.4万", "collects": "number", "comments": "number", "shares": "number" } } ``` ## Data Limitations **Platform Restrictions:** - ❌ **View count** (预览量/浏览量) - Not publicly displayed, cannot be extracted - ⚠️ **Comments/Shares** - Occasionally unavailable (~80% success rate) - ✅ **Likes/Favorites** - Usually available **Access Issues:** Some posts may show "当前笔记暂时无法浏览" (Post temporarily unavailable) due to: - Post deleted by author - Post set to private/followers-only - Account restrictions - Regional limitations ## Directory Structure ``` xiaohongshu-skill/ ├── SKILL.md # This file ├── scripts/ │ ├── scrape.js # Main scraper │ ├── get-stats.js # Stats extraction │ └── login.js # Login helper ├── results/ # Scraped data (auto-created) └── xhs_cookies.json # Saved login state (auto-created) ``` ## Troubleshooting ### Post shows "无法浏览" (Unavailable) **Possible causes:** 1. Login state expired 2. Post is private or deleted 3. Requires specific account to view **Solutions:** ```bash # Delete old cookies and re-login rm xhs_cookies.json node scripts/login.js ``` ### Browser doesn't open Make sure Playwright Chromium is installed: ```bash npx playwright install chromium ``` ### Extraction fails 1. Check if the post URL is accessible in a browser 2. Verify you're logged in to the correct account 3. Try waiting longer - increase timeout in scripts/scrape.js ## Examples ### Example 1: Scrape a Single Post ```bash node scripts/scrape.js "https://www.xiaohongshu.com/discovery/item/6476b9c20000000011011309" ``` **Result:** `results/xhs_post_1767173214027.json` ### Example 2: Get Statistics Only ```bash node scripts/get-stats.js "https://www.xiaohongshu.com/discovery/item/6476b9c20000000011011309" ``` **Result:** Just the engagement metrics ### Example 3: Batch Scraping ```bash # Create a list of URLs cat urls.txt | while read url; do node scripts/scrape.js "$url" done ``` ## Technical Notes **Technology Stack:** - Playwright - Browser automation - Node.js - Runtime environment - Chromium - Headless browser **Why Playwright:** - Simulates real user behavior - Handles dynamic content - Works with logged-in sessions - Reliable for JavaScript-heavy sites **Wait Times:** - 20 seconds default for page load - Ensures all content is loaded - Can be adjusted in scripts/scrape.js ## Best Practices 1. **Always log in from a real account** - Some content requires login 2. **Respect rate limits** - Add delays between requests 3. **Check cookies periodically** - Re-login if scraping fails 4. **Verify URLs** - Make sure posts are public when sharing scripts 5. **Handle errors gracefully** - Check for "无法浏览" in results ## Related Skills For TikTok scraping, see: `../tiktok-skill/SKILL.md`