# scraper-development > Create property listing scrapers with proxy support and error resilience - Author: Justinsato - Repository: Justinsato/brickston-ai - Version: 20260203201127 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/Justinsato/brickston-ai - Web: https://mule.run/skillshub/@@Justinsato/brickston-ai~scraper-development:20260203201127 --- --- name: scraper-development description: Create property listing scrapers with proxy support and error resilience --- # Scraper Development Skill ## Overview This skill guides you through creating new property scrapers for the brickston-ai competition analysis system. ## File Locations - **Scrapers**: `apps/api/app/data/scrapers/` - **Provider Base**: `apps/api/app/data/scrapers/providers/` - **Orchestrator**: `apps/api/app/data/scrapers/scraper_orchestrator.py` - **Factory**: `apps/api/app/data/scrapers/scraper_factory.py` - **Config**: `apps/api/app/core/scraper_config.py` ## Creating a New Provider ### Step 1: Create Provider File Location: `apps/api/app/data/scrapers/providers/_provider.py` ```python from typing import List, Optional import httpx from app.data.scrapers.providers.base_provider import BaseProvider from app.data.scrapers.models import ListingData from httpx import ProxyError class NewSourceProvider(BaseProvider): """Scraper for newsource.com listings.""" SOURCE_NAME = "newsource" BASE_URL = "https://newsource.com" async def scrape(self, proxy_url: Optional[str] = None) -> List[ListingData]: """Scrape listings from the source.""" listings = [] try: async with httpx.AsyncClient(proxy=proxy_url, timeout=30.0) as client: response = await client.get(f"{self.BASE_URL}/api/listings") response.raise_for_status() data = response.json() for item in data.get("listings", []): listings.append(self._parse_listing(item)) except ProxyError as e: # Retry without proxy self.logger.warning(f"Proxy failed, retrying without: {e}") return await self.scrape(proxy_url=None) except Exception as e: self.logger.error(f"Scrape failed: {e}") raise return listings def _parse_listing(self, raw: dict) -> ListingData: """Parse raw listing data into standardized format.""" return ListingData( source=self.SOURCE_NAME, external_id=raw.get("id"), address=raw.get("address"), city=raw.get("city"), state=raw.get("state"), zip_code=raw.get("zip"), price=raw.get("rent"), bedrooms=raw.get("beds"), bathrooms=raw.get("baths"), sqft=raw.get("sqft"), url=raw.get("url"), ) ``` ### Step 2: Register in Factory Edit `apps/api/app/data/scrapers/scraper_factory.py`: ```python from app.data.scrapers.providers.newsource_provider import NewSourceProvider PROVIDERS = { # ... existing providers "newsource": NewSourceProvider, } ``` ### Step 3: Add to Orchestrator (Optional) If the scraper should run in nightly jobs, add to orchestrator config. ## Error Handling Patterns ### Proxy Error Recovery Always implement proxy fallback: ```python except ProxyError as e: self.logger.warning(f"Proxy failed for {self.SOURCE_NAME}, retrying without proxy") return await self.scrape(proxy_url=None) ``` ### Rate Limiting ```python import asyncio async def scrape_with_rate_limit(self, urls: List[str]): for i, url in enumerate(urls): if i > 0: await asyncio.sleep(1.0) # 1 second delay between requests # ... scrape logic ``` ### Retry Logic ```python from tenacity import retry, stop_after_attempt, wait_exponential @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10)) async def fetch_with_retry(self, url: str): # ... fetch logic ``` ## Testing ```bash # Test single scraper cd apps/api python -m pytest tests/scrapers/test_newsource.py -v # Test with scripts python scripts/test_scrapers.py --provider newsource ``` ## Data Model Ensure listings match the `ListingData` schema: - `source`: Provider name - `external_id`: Unique ID from source - `address`, `city`, `state`, `zip_code`: Location - `price`: Monthly rent - `bedrooms`, `bathrooms`, `sqft`: Unit specs - `url`: Link to original listing ## Checklist - [ ] Provider class created in `providers/` - [ ] Registered in `scraper_factory.py` - [ ] Proxy error handling implemented - [ ] Rate limiting for respectful scraping - [ ] Standardized `ListingData` output - [ ] Unit tests written