# data-collector > Collect market data from external sources with ethical scraping practices, proper rate limiting, and full provenance tracking. - Author: cgjen-box - Repository: cjplanted/planted-market-model - Version: 20251221143126 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/cjplanted/planted-market-model - Web: https://mule.run/skillshub/@@cjplanted/planted-market-model~data-collector:20251221143126 --- --- name: data-collector description: Ethical web scraping and API integration for market data collection. Triggers on tasks requiring: (1) fetching data from external sources (GFI, FAO, OECD, national statistics), (2) PDF parsing for report extraction, (3) API integrations with rate limiting, (4) data validation against schemas. Respects robots.txt, implements caching, and maintains provenance tracking. --- # Data Collector Agent Skill Collect market data from external sources with ethical scraping practices, proper rate limiting, and full provenance tracking. ## Core Principles 1. **Robots.txt First**: Always check before any request 2. **Rate Limit**: 1 request per 2 seconds minimum 3. **Cache Aggressively**: 24-hour TTL for static reports 4. **Provenance Always**: Track source, date, page number for every data point ## Quick Reference ```python from src.scrapers.ethical_scraper import EthicalScraper scraper = EthicalScraper( user_agent="PlantedBot/1.0 (research@planted.ch)", rate_limit=0.5 # requests per second ) # Check if allowed if scraper.can_fetch(url): content = scraper.fetch(url, use_cache=True) ``` ## Supported Sources | Source | Type | Parser | Location | |--------|------|--------|----------| | GFI Reports | PDF | `gfi_parser.py` | `/src/scrapers/gfi_parser.py` | | FAOSTAT | API | `faostat_client.py` | `/src/scrapers/faostat_client.py` | | OECD | API | `oecd_client.py` | `/src/scrapers/oecd_client.py` | | National Stats | Web | `national_scraper.py` | `/src/scrapers/national_scraper.py` | ## Data Output Schema All collected data must conform to: ```python @dataclass class RawDataPoint: source: str # e.g., "gfi" source_url: str # Original URL source_date: str # Publication date extracted_at: datetime # When we extracted geography: str # ISO 3166-1 alpha-2 year: int metric: str # e.g., "alt_protein_retail_value" value: float unit: str # e.g., "USD_millions" confidence: float # 0-1, how confident in extraction notes: Optional[str] page_reference: Optional[str] # For PDFs ``` ## PDF Extraction Workflow ```python # 1. Download PDF pdf_path = scraper.download_pdf(url, save_to="/data/raw/gfi/") # 2. Extract text from src.scrapers.pdf_extractor import PDFExtractor extractor = PDFExtractor(pdf_path) # 3. Find tables (most market data is in tables) tables = extractor.extract_tables() # 4. Find specific metrics for table in tables: if "market size" in table.header.lower(): data_points = parse_market_table(table) # 5. Validate and save for dp in data_points: if validate_data_point(dp): save_to_raw(dp) ``` ## Error Handling ```python # Retry with exponential backoff @retry(tries=3, delay=2, backoff=2) def fetch_with_retry(url): return scraper.fetch(url) # Handle common errors try: content = fetch_with_retry(url) except RateLimitError: wait_and_retry(url, wait_minutes=5) except RobotsTxtForbidden: log_warning(f"Blocked by robots.txt: {url}") return None except PDFParseError as e: log_error(f"PDF parse failed: {e}") add_to_manual_review(url) ``` ## Caching Strategy ```python # Cache structure /data/raw/.cache/ ├── {source}/ │ ├── {url_hash}.json # Cached response │ └── {url_hash}.meta.json # Cache metadata # Cache metadata { "url": "https://...", "fetched_at": "2024-12-19T10:00:00Z", "expires_at": "2024-12-20T10:00:00Z", "etag": "...", "content_hash": "sha256:..." } ``` ## Source-Specific Notes ### GFI Reports - Annual "State of the Industry" PDFs - Data tables usually in consistent format - Investment data also available (useful for secondary analysis) - Sometimes have regional breakdowns ### FAOSTAT - REST API with good documentation - Rate limit: ~100 requests/minute - Data lag: ~2 years behind current year - Use for meat consumption baseline ### OECD-FAO - Agricultural outlook reports - PDF + some structured data - 10-year projections - Useful for validation ## Validation Checks ```python def validate_data_point(dp: RawDataPoint) -> bool: checks = [ dp.value > 0, dp.year >= 2010 and dp.year <= 2030, dp.geography in VALID_COUNTRIES, dp.confidence >= 0.5, dp.unit in VALID_UNITS ] return all(checks) ``` ## Swarm Coordination When orchestrator spawns multiple data collectors: ```python # Orchestrator divides work sources = ["gfi", "faostat", "oecd", "national"] for source in sources: orchestrator.spawn( agent="data-collector", task=f"collect-{source}", context={"source": source, "years": range(2015, 2025)} ) # Each agent reports back def on_complete(results): save_to_interim(results) notify_orchestrator("collection_complete", source=self.source) ``` ## Files in This Skill - `scripts/init_scraper.py` - Initialize scraper with config - `scripts/validate_raw_data.py` - Validate collected data - `references/source_urls.md` - All data source URLs - `references/parsing_rules.md` - Source-specific parsing rules