# data-collector

> Collect market data from external sources with ethical scraping practices, proper rate limiting, and full provenance tracking.

- Author: cgjen-box
- Repository: cjplanted/planted-market-model
- Version: 20251221143126
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/cjplanted/planted-market-model
- Web: https://mule.run/skillshub/@@cjplanted/planted-market-model~data-collector:20251221143126

---

---
name: data-collector
description: Ethical web scraping and API integration for market data collection. Triggers on tasks requiring: (1) fetching data from external sources (GFI, FAO, OECD, national statistics), (2) PDF parsing for report extraction, (3) API integrations with rate limiting, (4) data validation against schemas. Respects robots.txt, implements caching, and maintains provenance tracking.
---

# Data Collector Agent Skill

Collect market data from external sources with ethical scraping practices, proper rate limiting, and full provenance tracking.

## Core Principles

1. **Robots.txt First**: Always check before any request
2. **Rate Limit**: 1 request per 2 seconds minimum
3. **Cache Aggressively**: 24-hour TTL for static reports
4. **Provenance Always**: Track source, date, page number for every data point

## Quick Reference

```python
from src.scrapers.ethical_scraper import EthicalScraper

scraper = EthicalScraper(
    user_agent="PlantedBot/1.0 (research@planted.ch)",
    rate_limit=0.5  # requests per second
)

# Check if allowed
if scraper.can_fetch(url):
    content = scraper.fetch(url, use_cache=True)
```

## Supported Sources

| Source | Type | Parser | Location |
|--------|------|--------|----------|
| GFI Reports | PDF | `gfi_parser.py` | `/src/scrapers/gfi_parser.py` |
| FAOSTAT | API | `faostat_client.py` | `/src/scrapers/faostat_client.py` |
| OECD | API | `oecd_client.py` | `/src/scrapers/oecd_client.py` |
| National Stats | Web | `national_scraper.py` | `/src/scrapers/national_scraper.py` |

## Data Output Schema

All collected data must conform to:

```python
@dataclass
class RawDataPoint:
    source: str              # e.g., "gfi"
    source_url: str          # Original URL
    source_date: str         # Publication date
    extracted_at: datetime   # When we extracted
    geography: str           # ISO 3166-1 alpha-2
    year: int
    metric: str              # e.g., "alt_protein_retail_value"
    value: float
    unit: str                # e.g., "USD_millions"
    confidence: float        # 0-1, how confident in extraction
    notes: Optional[str]
    page_reference: Optional[str]  # For PDFs
```

## PDF Extraction Workflow

```python
# 1. Download PDF
pdf_path = scraper.download_pdf(url, save_to="/data/raw/gfi/")

# 2. Extract text
from src.scrapers.pdf_extractor import PDFExtractor
extractor = PDFExtractor(pdf_path)

# 3. Find tables (most market data is in tables)
tables = extractor.extract_tables()

# 4. Find specific metrics
for table in tables:
    if "market size" in table.header.lower():
        data_points = parse_market_table(table)
        
# 5. Validate and save
for dp in data_points:
    if validate_data_point(dp):
        save_to_raw(dp)
```

## Error Handling

```python
# Retry with exponential backoff
@retry(tries=3, delay=2, backoff=2)
def fetch_with_retry(url):
    return scraper.fetch(url)

# Handle common errors
try:
    content = fetch_with_retry(url)
except RateLimitError:
    wait_and_retry(url, wait_minutes=5)
except RobotsTxtForbidden:
    log_warning(f"Blocked by robots.txt: {url}")
    return None
except PDFParseError as e:
    log_error(f"PDF parse failed: {e}")
    add_to_manual_review(url)
```

## Caching Strategy

```python
# Cache structure
/data/raw/.cache/
├── {source}/
│   ├── {url_hash}.json       # Cached response
│   └── {url_hash}.meta.json  # Cache metadata

# Cache metadata
{
    "url": "https://...",
    "fetched_at": "2024-12-19T10:00:00Z",
    "expires_at": "2024-12-20T10:00:00Z",
    "etag": "...",
    "content_hash": "sha256:..."
}
```

## Source-Specific Notes

### GFI Reports
- Annual "State of the Industry" PDFs
- Data tables usually in consistent format
- Investment data also available (useful for secondary analysis)
- Sometimes have regional breakdowns

### FAOSTAT
- REST API with good documentation
- Rate limit: ~100 requests/minute
- Data lag: ~2 years behind current year
- Use for meat consumption baseline

### OECD-FAO
- Agricultural outlook reports
- PDF + some structured data
- 10-year projections
- Useful for validation

## Validation Checks

```python
def validate_data_point(dp: RawDataPoint) -> bool:
    checks = [
        dp.value > 0,
        dp.year >= 2010 and dp.year <= 2030,
        dp.geography in VALID_COUNTRIES,
        dp.confidence >= 0.5,
        dp.unit in VALID_UNITS
    ]
    return all(checks)
```

## Swarm Coordination

When orchestrator spawns multiple data collectors:

```python
# Orchestrator divides work
sources = ["gfi", "faostat", "oecd", "national"]
for source in sources:
    orchestrator.spawn(
        agent="data-collector",
        task=f"collect-{source}",
        context={"source": source, "years": range(2015, 2025)}
    )

# Each agent reports back
def on_complete(results):
    save_to_interim(results)
    notify_orchestrator("collection_complete", source=self.source)
```

## Files in This Skill

- `scripts/init_scraper.py` - Initialize scraper with config
- `scripts/validate_raw_data.py` - Validate collected data
- `references/source_urls.md` - All data source URLs
- `references/parsing_rules.md` - Source-specific parsing rules