# data-pipeline

> Run the charity evaluation pipeline - extraction, 4-stage V2 workflow, Supabase patterns, versioning. Use when working on collectors, scrapers, pipeline code, database queries, or debugging data flow.

- Author: uabbasi
- Repository: uabbasi/good-measure-giving
- Version: 20260208202148
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-09
- Source: https://github.com/uabbasi/good-measure-giving
- Web: https://mule.run/skillshub/@@uabbasi/good-measure-giving~data-pipeline:20260208202148

---

---
name: data-pipeline
description: Run the charity evaluation pipeline - extraction, 4-stage V2 workflow, Supabase patterns, versioning. Use when working on collectors, scrapers, pipeline code, database queries, or debugging data flow.
---

# Data Pipeline

4-stage V2 charity evaluation pipeline with 100-point scoring.

**Philosophy**: Capture broadly, filter later. Correctness > cost, but we can have both.

---

## Quick Reference (V2 Pipeline)

| Stage | Entry Point | What It Does |
|-------|-------------|--------------|
| 1. Crawl | `crawl.py` | Collect data from 5 sources |
| 2. Process Data | `process_data.py` | Derive fields + reconcile sources |
| 3. Process Baseline | `process_baseline.py` | Generate baseline narratives + export + verify |
| 4. Process Rich | `process_rich.py` | Generate rich narratives + export + verify |

**Wrapper**: `./run_v2.sh` runs all 4 stages

---

## Decision Tree

**Working on data collection?**
→ See [extraction.md](extraction.md) for sources, patterns, red flags

**Working on pipeline phases or state machine?**
→ See [orchestration.md](orchestration.md) for workflow, CLI, transitions

**Working with database or debugging queries?**
→ See `data-pipeline/src/db/` for Supabase repositories

**Implementing freshness checks or versioning?**
→ See [versioning.md](versioning.md) for hashing, TTLs, skip logic

---

## State Machine

```
NOT_STARTED → COLLECTED → DERIVED → RECONCILED
    → BASELINE_QUEUED → BASELINE_REVIEW
    → RICH_QUEUED → RICH_REVIEW
    → APPROVED (terminal) or REJECTED (terminal)
```

Terminal states require `force=True` to transition.

---

## Key Files

```
data-pipeline/
├── run_v2.sh                    # Wrapper: all 4 stages
├── crawl.py                     # Stage 1: Collect data
├── process_data.py              # Stage 2: Derive + reconcile
├── process_baseline.py          # Stage 3: Baseline narratives
├── process_rich.py              # Stage 4: Rich narratives
├── src/
│   ├── collectors/              # 5 data sources
│   ├── evaluators/              # NarrativeEvaluator, Judge
│   ├── scorers/                 # V2 scoring (100-point scale)
│   ├── quality_judges/          # LLM-as-judge scorers
│   ├── database/                # Schema, WriteQueue, repository
│   └── cli/wizard.py            # Interactive menu (uv run z)
└── pilot_charities.txt          # Source of truth for EINs
```

---

## CLI Commands

```bash
# Full V2 pipeline
./run_v2.sh --charities pilot_charities.txt --workers 10

# Individual stages
uv run python crawl.py --charities pilot_charities.txt --workers 10
uv run python process_data.py --charities pilot_charities.txt
uv run python process_baseline.py --charities pilot_charities.txt --workers 5
uv run python process_rich.py --charities rich_charities.txt --workers 3

# Interactive wizard
uv run z

# Status
zakaat status --ein 95-4453134
```

---

## Critical Patterns

### Supabase Repositories

Data access via repository pattern in `src/db/`:

```python
from src.db import get_client
from src.db.charity_repository import CharityRepository

client = get_client()
repo = CharityRepository(client)
charity = repo.get_by_ein("95-4453134")
```

### Pilot Charities

All operations scope to `pilot_charities.txt`:

```python
from src.cli.wizard import get_pilot_eins
eins = get_pilot_eins()
```

---

## Anti-Patterns

**Don't:**
- Skip phases (must go in order)
- Hardcode EINs (use `pilot_charities.txt`)
- Fabricate missing data

**Do:**
- Use repository pattern for database access
- Track source for every datum
- Check freshness before expensive operations

---

## Related Skills

- **llm-prompting**: Prompt patterns, schema enforcement
- **form990-expert**: 990 parsing, financial analysis
- **zakat-fiqh**: Zakat classification, wallet tags