# scholarag > Build PRISMA 2020-compliant systematic literature review systems with RAG-powered analysis in VS Code. Use when researcher needs automated paper retrieval (Semantic Scholar, OpenAlex, arXiv), AI-assisted PRISMA screening (50% or 90% threshold), vector database creation (ChromaDB), or research conversation interface. Supports knowledge_repository (comprehensive, 15K+ papers, teaching/exploration) and systematic_review (publication-quality, 50-300 papers, meta-analysis) modes. Conversation-first workflow with 7 stages. - Author: HosungYou - Repository: HosungYou/ScholaRAG - Version: 20260130183059 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/HosungYou/ScholaRAG - Web: https://mule.run/skillshub/@@HosungYou/ScholaRAG~scholarag:20260130183059 --- --- name: scholarag description: Build PRISMA 2020-compliant systematic literature review systems with RAG-powered analysis in VS Code. Use when researcher needs automated paper retrieval (Semantic Scholar, OpenAlex, arXiv), AI-assisted PRISMA screening (50% or 90% threshold), vector database creation (ChromaDB), or research conversation interface. Supports knowledge_repository (comprehensive, 15K+ papers, teaching/exploration) and systematic_review (publication-quality, 50-300 papers, meta-analysis) modes. Conversation-first workflow with 7 stages. --- # ScholaRAG: Systematic Review Automation Skill **For**: Claude Code (AI assistant in VS Code) **Purpose**: Guide researchers through PRISMA 2020 systematic literature review + RAG-powered analysis --- ## Quick Start (5 minutes) ### For Researchers 1. **Initialize project**: `python scholarag_cli.py init` 2. **Paste Stage 1 prompt**: Copy from [https://www.scholarag.com/guide/01-introduction](https://www.scholarag.com/guide/01-introduction) 3. **Answer Claude's questions** → Config created automatically 4. **Proceed through 7 stages** conversationally ### For AI Assistants (Claude Code) When researcher provides a ScholaRAG prompt: 1. **Check for HTML metadata block** (`` at top of prompt) 2. **Identify current stage** (1-7) from metadata `stage` field 3. **Follow conversation pattern** (from metadata `conversation_flow`) 4. **Validate completion** (using metadata `validation_rules`) 5. **Auto-execute commands** (when `auto_execute: true`) 6. **Update `.claude/context.json`** (track progress) 7. **Show next stage prompt** (from metadata `next_stage.prompt_file`) **Researcher should NEVER touch terminal**. You execute all scripts automatically. --- ## 7-Stage Workflow Overview | Stage | Name | Read This File | Duration | Auto-Execute | |-------|------|----------------|----------|--------------| | 1 | Research Setup | [skills/claude_only/stage1_research_setup.md](skills/claude_only/stage1_research_setup.md) | 15-20 min | ✅ `scholarag init` | | 2 | Query Strategy | [skills/claude_only/stage2_query_strategy.md](skills/claude_only/stage2_query_strategy.md) | 15-25 min | ❌ Manual review | | 3 | PRISMA Config | [skills/claude_only/stage3_prisma_config.md](skills/claude_only/stage3_prisma_config.md) | 20-30 min | ❌ Manual review | | 4 | RAG Design | [skills/claude_only/stage4_rag_design.md](skills/claude_only/stage4_rag_design.md) | 10-15 min | ❌ Manual review | | 5 | Execution | [skills/claude_only/stage5_execution.md](skills/claude_only/stage5_execution.md) | 2-4 hours | ✅ Run all 5 scripts | | 6 | Research Conversation | [skills/claude_only/stage6_research_conversation.md](skills/claude_only/stage6_research_conversation.md) | Ongoing | ❌ Interactive | | 7 | Documentation | [skills/claude_only/stage7_documentation.md](skills/claude_only/stage7_documentation.md) | 30-60 min | ✅ Generate PRISMA | **Progressive Disclosure**: Load stage file **only when researcher enters that stage**. Don't preload all 7 stages (token waste). --- ## Critical Branching Points ### 🔴 project_type (Stage 1 Decision) **Two modes available**: | Mode | Threshold | Output | Best For | |------|-----------|--------|----------| | `knowledge_repository` | 50% (lenient) | 15K-20K papers | Teaching, AI assistant, exploration | | `systematic_review` | 90% (strict) | 50-300 papers | Meta-analysis, publication | **Quick decision**: - Publishing systematic review? → `systematic_review` ✅ - Comprehensive domain coverage? → `knowledge_repository` ✅ **Detailed decision tree**: [skills/reference/project_type_decision_tree.md](skills/reference/project_type_decision_tree.md) **When to read decision tree**: - Researcher asks: "Which project_type should I choose?" - Researcher says: "I'm unsure about my research goals" - Stage 1 initialization (proactively offer decision helper) --- ### 🔴 Stage 6 Scenarios (7 Research Modes) **Stage 6 branches into 7 specialized conversation scenarios**: 1. **overview** (Context Scanning): High-level themes, methods, findings 2. **hypothesis** (Hypothesis Validation): Evidence for/against with effect sizes 3. **statistics** (Statistical Extraction): RCT data table (tools, Cohen's d, samples) 4. **methods** (Methodology Comparison): RCT vs quasi vs mixed methods 5. **contradictions** (Contradiction Detection): Conflicting results + analysis 6. **policy** (Policy Translation): Actionable recommendations for stakeholders 7. **grant** (Future Research Design): Follow-up study design + hypotheses **Details**: [skills/claude_only/stage6_research_conversation.md](skills/claude_only/stage6_research_conversation.md) **When to read**: Stage 6 entry (researcher asks "What can I query?") --- ## Error Recovery **When errors occur**: [skills/reference/error_recovery.md](skills/reference/error_recovery.md) **Quick fixes** (common issues): | Error | Quick Fix | Detailed Guide | |-------|-----------|----------------| | Too many papers (>30K) | Refine query in Stage 2, re-run fetch | error_recovery.md §2.1 | | API key missing | Add `ANTHROPIC_API_KEY` to `.env` | error_recovery.md §3.1 | | Low PDF success (<30%) | Filter for `open_access` in Stage 1 | error_recovery.md §4.1 | | All papers excluded (0 papers) | Lower threshold or broaden query | error_recovery.md §3.2 | --- ## Reference Materials (Load Only When Needed) **Progressive disclosure**: Don't preload these. Read **only when** researcher asks specific questions. | Topic | File | When to Read | |-------|------|--------------| | API endpoints | [skills/reference/api_reference.md](skills/reference/api_reference.md) | Researcher asks about Semantic Scholar, OpenAlex, arXiv | | Config schema | [skills/reference/config_schema.md](skills/reference/config_schema.md) | Researcher asks "What fields are in config.yaml?" | | PRISMA checklist | [skills/reference/prisma_guidelines.md](skills/reference/prisma_guidelines.md) | Researcher asks about PRISMA 2020 compliance | | Troubleshooting | [skills/reference/troubleshooting.md](skills/reference/troubleshooting.md) | Researcher reports errors not in Quick Fixes | --- ## Architecture Overview **File dependencies**: [https://www.scholarag.com/codebook/architecture](https://www.scholarag.com/codebook/architecture) **Key principle**: Scripts read from `config.yaml` (single source of truth), **never hardcode values**. **Critical scripts** (read `project_type` from config): - `03_screen_papers.py`: Sets threshold (50% or 90%) - `07_generate_prisma.py`: Changes diagram title ("Knowledge Repository" vs "Systematic Review") --- ## For Codex Users **If researcher is using OpenAI Codex instead of Claude Code**: See [AGENTS.md](AGENTS.md) for bash-based task workflows. Codex workflow differs: - **Task-oriented** (not conversation-oriented) - **Bash commands** (not validation rules) - **Exit codes** (not metadata parsing) **Universal reference files** (Claude + Codex both use): - `skills/reference/project_type_decision_tree.md` - `skills/reference/api_reference.md` - `skills/reference/config_schema.md` --- ## Token Optimization Notes **This file**: ~400 lines (loaded once per conversation) **Stage-specific files**: ~300-500 lines each (loaded on-demand) **Total per conversation**: ~700 lines (this file + current stage file) **Previous approach**: ~2,000 lines (all context upfront) **Token reduction**: **65%** ✅ **How it works**: 1. Researcher starts Stage 1 → You load this file + `stage1_research_setup.md` 2. Researcher moves to Stage 2 → You load `stage2_query_strategy.md` (Stage 1 file unloaded) 3. Reference files loaded **only when** researcher asks (e.g., "How does Semantic Scholar API work?") --- ## Metadata Block Format **All prompts in `prompts/*.md` contain HTML comment metadata at top**: ```html ``` **How to use**: 1. **Parse YAML inside HTML comment** (lines between ``) 2. **Extract fields**: `stage`, `expected_turns`, `validation_rules`, `cli_commands`, `next_stage` 3. **Follow conversation pattern**: Ask questions matching `expected_turns` count 4. **Validate**: Check user inputs against `validation_rules` 5. **Execute**: Run `cli_commands` when conversation complete 6. **Transition**: Show prompt from `next_stage.prompt_file` --- ## Divergence Handling **Common researcher confusions** (from metadata `divergence_handling`): ### Divergence 1: "Can you help me download PDFs?" (in Stage 1) **Response**: "PDF downloading happens in Stage 4 (after screening in Stage 3). Right now in Stage 1, let's first define your research scope and choose project_type. We'll design queries in Stage 2, configure PRISMA in Stage 3, then download PDFs in Stage 4." ### Divergence 2: "I want to skip systematic review" (in Stage 1) **Response**: "If you don't need publication-quality systematic review, choose `project_type: knowledge_repository` in the next question. This mode uses lenient filtering (50% threshold) for comprehensive domain coverage (15K-20K papers). It's perfect for teaching materials, AI assistants, or exploratory research." ### Divergence 3: "What's the difference between the two modes?" (in Stage 1) **Response**: "Let me explain: **knowledge_repository**: - 50% threshold (lenient, removes only spam) - 15,000-20,000 papers output - For: Teaching, exploration, AI assistant **systematic_review**: - 90% threshold (strict, PRISMA 2020) - 50-300 papers output - For: Meta-analysis, publication See full decision tree: [skills/reference/project_type_decision_tree.md](skills/reference/project_type_decision_tree.md)" --- ## Conversation Flow Example (Stage 1) **Typical pattern** (6-10 turns): 1. **Turn 1**: Researcher provides research topic - **You ask**: "Is this for exploratory domain mapping or publication-quality systematic review?" 2. **Turn 2-3**: Researcher answers scope questions - **You suggest**: `project_type` based on answers, explain threshold implications - **Example**: "Based on your goal of meta-analysis, I recommend `systematic_review` mode with 90% screening threshold." 3. **Turn 4-5**: Researcher confirms project_type choice - **You suggest**: Year range, publication types, expected databases - **Example**: "For language learning studies, I recommend 2015-2025 (10 years) focusing on Semantic Scholar and ERIC." 4. **Turn 6-8**: Researcher provides final details (domain, year range) - **You summarize**: All decisions, ask for confirmation - **Example**: "Here's what I'll create: [summary]. Ready to initialize?" 5. **Turn 9-10**: Researcher confirms initialization - **You execute**: `scholarag_cli.py init`, create `config.yaml`, show next steps - **Example**: "✅ Project initialized! Next, let's design your search query in Stage 2." --- ## Completion Checklist (Stage-Specific) **Stage 1 example** (from metadata `completion_checklist`): - [ ] `project_name` is descriptive and unique (≥10 chars) - [ ] `research_question` is specific and answerable (≥20 chars) - [ ] `project_type` chosen with understanding of implications (50% vs 90%) - [ ] `year_range` is realistic for scope (≤25 years, not before 2000) - [ ] `config.yaml` created successfully (file exists, valid YAML) **When all checked** → Auto-execute `scholarag_cli.py init` → Show Stage 2 prompt --- ## Example Commands You Will Execute ### Stage 1: Initialize ```bash python scholarag_cli.py init \ --name "AI-Chatbots-Language-Learning" \ --question "How do AI chatbots improve speaking proficiency in EFL learners?" \ --project-type systematic_review ``` ### Stage 5: Run Pipeline (All 5 Scripts) ```bash # Fetch papers python scripts/01_fetch_papers.py --project projects/YYYY-MM-DD_ProjectName # Deduplicate python scripts/02_deduplicate.py --project projects/YYYY-MM-DD_ProjectName # Screen with AI python scripts/03_screen_papers.py --project projects/YYYY-MM-DD_ProjectName # Download PDFs python scripts/04_download_pdfs.py --project projects/YYYY-MM-DD_ProjectName # Build RAG python scripts/05_build_rag.py --project projects/YYYY-MM-DD_ProjectName ``` ### Stage 7: Generate PRISMA ```bash python scripts/07_generate_prisma.py --project projects/YYYY-MM-DD_ProjectName ``` --- ## Integration with .claude/context.json **You should update this file after each stage**: ```json { "current_stage": { "stage": 2, "name": "Query Strategy", "status": "in_progress", "started_at": "2025-10-24T10:30:00Z" }, "completed_stages": [ { "stage": 1, "name": "Research Setup", "completed_at": "2025-10-24T10:25:00Z", "outputs": { "project_name": "AI-Chatbots-Language-Learning", "research_question": "How do AI chatbots improve speaking proficiency?", "project_type": "systematic_review" } } ], "project": { "name": "AI-Chatbots-Language-Learning", "created": "2025-10-24", "research_question": "How do AI chatbots improve speaking proficiency in EFL learners?", "project_type": "systematic_review" } } ``` **Purpose**: Track progress, enable `scholarag status` command to show current stage. --- ## FAQ for AI Assistants ### Q: Should I always read stage files in order (1→2→3...)? **A**: No! Read **only the file for the current stage** researcher is in. Use progressive disclosure. ### Q: What if researcher jumps to Stage 5 without completing Stages 1-4? **A**: Check `.claude/context.json` for completed stages. If missing prerequisites, politely redirect: "Stage 5 requires config.yaml from Stage 1, search_query from Stage 2, and PRISMA criteria from Stage 3. Let's complete those first." ### Q: When should I read `skills/reference/` files? **A**: **Only when researcher explicitly asks**. Examples: - "How does Semantic Scholar API work?" → Read `api_reference.md` - "What are all the config.yaml fields?" → Read `config_schema.md` - "Why should I choose systematic_review?" → Read `project_type_decision_tree.md` ### Q: What if I don't understand metadata in prompts/*.md? **A**: All metadata fields are documented in `skills/claude_only/metadata_spec.md`. Read that file if you encounter unknown fields. --- ## Additional Resources **Detailed implementation guide**: See [CLAUDE.md](CLAUDE.md) for: - 🎓 User profile (researchers with limited coding experience) - How Claude Code should behave (DO/DON'T guidelines) - Auto-execution patterns (echo pipes, CLI arguments) - Full CLI reference and troubleshooting **For Codex/Cursor users**: See [AGENTS.md](AGENTS.md) for task-based bash workflows --- **Last Updated**: 2025-10-24 (v2.0 - Agent Skills Integration) **Companion files**: CLAUDE.md (detailed guide), AGENTS.md (Codex workflows) **Compatible with**: Claude Code v1.0+, Anthropic API **Token Budget**: ~380 lines (this file) + ~300-500 lines (stage file) = ~700-900 lines per conversation