# brightspace-scraper > Scrape course materials from Brightspace LMS. Use when (1) need to download course content (slides, labs, assignments), (2) organize course materials locally, (3) filter specific module types, (4) batch download from multiple courses. - Author: Peng Wang - Repository: zhizhunbao/aisd - Version: 20260122183808 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/zhizhunbao/aisd - Web: https://mule.run/skillshub/@@zhizhunbao/aisd~brightspace-scraper:20260122183808 --- --- name: brightspace-scraper description: Scrape course materials from Brightspace LMS. Use when (1) need to download course content (slides, labs, assignments), (2) organize course materials locally, (3) filter specific module types, (4) batch download from multiple courses. --- # Brightspace Course Scraper ## Objectives - Automate downloading of course materials from Brightspace LMS - Organize content by course and module hierarchy - Filter specific content types (slides, labs, assignments, etc.) - Handle authentication and session management - Avoid re-downloading unchanged content ## Script Location `.skills/learning-brightspace_scraper/scripts/brightspace/scraper.py` ## Quick Start ### 1. First Time Setup - Login ```bash cd .skills/learning-brightspace_scraper/scripts uv run python run.py --login-only ``` This opens a browser for manual login. Session is saved to `.session.json` for future use. ### 2. List Available Courses ```bash uv run python run.py --list-courses ``` Shows all enrolled courses with their IDs. ### 3. Scrape Entire Course ```bash uv run python run.py --course 846088 ``` ### 4. Scrape Specific Module Type ```bash # Only slides uv run python run.py --course 846088 --module slides # Only labs uv run python run.py --course 846088 --module labs # Only assignments uv run python run.py --course 846088 --module assignment # Specific week uv run python run.py --course 846088 --module "Week 1" ``` ## Configuration Edit `.skills/learning-brightspace_scraper/scripts/brightspace/config.py`: ```python COURSES = { "846088": "ml", # Course ID -> local directory name "846083": "nlp", "846092": "mv", "846085": "rl", } OUTPUT_DIR = Path(__file__).parent / "data" # Temporary storage ``` ## Command Line Options | Option | Short | Description | | ---------------- | ----- | -------------------------------------- | | `--course` | `-c` | Course ID to scrape | | `--module` | `-m` | Filter modules by name (partial match) | | `--headless` | | Run browser in headless mode | | `--login-only` | | Only perform login and save session | | `--list-courses` | `-l` | List all available courses | | `--keep-open` | `-k` | Keep browser open after completion | | `--dump-html` | | Save page HTML for debugging | ## How It Works ### 1. Authentication - Uses Playwright to automate browser - Saves session cookies to `.session.json` - Reuses session for subsequent runs - Manual login required only once ### 2. Content Discovery - Navigates course content tree structure - Parses module hierarchy (parent/child relationships) - Identifies content types (PDF, PPTX, links, HTML pages) - Tracks content with unique IDs ### 3. Smart Downloading - Computes content hash to detect changes - Skips unchanged files (stored in `.content_hashes.json`) - Downloads files via "Download" button clicks - Extracts external links to `links.md` - Saves HTML snapshots for reference ### 4. Module Filtering When `--module` is specified: - Case-insensitive partial matching - Matches module title or full path - Automatically enters parent modules if children match - Example: `--module slides` matches "Slides", "Week 1 Slides", "Course Slides" ### 5. Content Organization ``` data/ # Root data directory └── ml/ # Course directory ├── .content_hashes.json # Change detection ├── index.html # Course home page ├── Week 1/ # Module │ ├── index.html │ ├── Slides/ # Sub-module │ │ ├── index.html │ │ ├── 12345_Lecture1.pdf │ │ └── links.md │ └── Labs/ │ ├── index.html │ └── 67890_Lab1.pdf └── Week 2/ └── ... ``` ## Common Workflows ### Workflow 1: Download All Course Materials ```bash cd .skills/learning-brightspace_scraper/scripts # Login once uv run python run.py --login-only # Scrape all configured courses uv run python run.py ``` ### Workflow 2: Update Specific Content Type ```bash # Only download new slides uv run python run.py --course 846088 --module slides # Only download new labs uv run python run.py --course 846088 --module labs ``` ### Workflow 3: Download Single Week ```bash uv run python run.py --course 846088 --module "Week 3" ``` ### Workflow 4: Debug Scraping Issues ```bash # Keep browser open to inspect uv run python run.py --course 846088 --keep-open # Dump HTML for analysis uv run python run.py --course 846088 --dump-html ``` ## Validation After scraping, verify: - [ ] Files downloaded to `data/{course_name}/` - [ ] Module hierarchy preserved in directory structure - [ ] `.content_hashes.json` created for change tracking - [ ] `links.md` files contain external resources - [ ] No duplicate downloads on re-run ## Troubleshooting ### Session Expired **Symptom:** Redirected to login page **Solution:** ```bash cd .skills/learning-brightspace_scraper/scripts/brightspace rm .session.json cd ../.. uv run python run.py --login-only ``` ### Missing Content **Symptom:** Expected files not downloaded **Solution:** - Check if content is in sub-module (use `--keep-open` to inspect) - Verify module filter isn't too restrictive - Check HTML snapshots for content structure ### Download Button Not Found **Symptom:** "No download button found" message **Solution:** - Content might be embedded (check HTML snapshot) - Try without `--headless` to see browser behavior - File might be in iframe or require special handling ### Rate Limiting **Symptom:** Slow downloads or timeouts **Solution:** - Script includes random delays (1-3 seconds) - Adjust `MIN_DELAY` and `MAX_DELAY` in script if needed ## Integration with Course Organization After scraping, move content to course directories: ```bash # Manual organization Copy-Item -Recurse data/ml/Week\ 1/Slides/*.pdf courses/ml/slides/ ``` ## Best Practices 1. **Login once per session** - Session persists across runs 2. **Use module filters** - Faster and more targeted 3. **Run incrementally** - Only new/changed content is downloaded 4. **Check HTML snapshots** - Useful for debugging structure 5. **Keep browser open for debugging** - Use `--keep-open` when troubleshooting 6. **Organize after scraping** - Move from `data/` to `courses/` structure ## Dependencies ```bash # Install with uv uv add playwright # Install browser uv run playwright install chromium ``` ## Advanced Usage ### Custom Course Mapping Add new courses to `config.py`: ```python COURSES = { "123456": "new-course", } ``` ### Modify Content Detection Edit `_process_item()` method to handle new file types: ```python file_types = ["pdf", "ppt", "pptx", "doc", "docx", "ipynb", "py"] ``` ### Change Output Directory Edit `config.py`: ```python OUTPUT_DIR = Path("/custom/path/to/output") ``` **For implementation details:** See `scripts/brightspace/scraper.py` source code