# scraper-development

> Create property listing scrapers with proxy support and error resilience

- Author: Justinsato
- Repository: Justinsato/brickston-ai
- Version: 20260203201127
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/Justinsato/brickston-ai
- Web: https://mule.run/skillshub/@@Justinsato/brickston-ai~scraper-development:20260203201127

---

---
name: scraper-development
description: Create property listing scrapers with proxy support and error resilience
---

# Scraper Development Skill

## Overview
This skill guides you through creating new property scrapers for the brickston-ai competition analysis system.

## File Locations
- **Scrapers**: `apps/api/app/data/scrapers/`
- **Provider Base**: `apps/api/app/data/scrapers/providers/`
- **Orchestrator**: `apps/api/app/data/scrapers/scraper_orchestrator.py`
- **Factory**: `apps/api/app/data/scrapers/scraper_factory.py`
- **Config**: `apps/api/app/core/scraper_config.py`

## Creating a New Provider

### Step 1: Create Provider File
Location: `apps/api/app/data/scrapers/providers/<source>_provider.py`

```python
from typing import List, Optional
import httpx
from app.data.scrapers.providers.base_provider import BaseProvider
from app.data.scrapers.models import ListingData
from httpx import ProxyError

class NewSourceProvider(BaseProvider):
    """Scraper for newsource.com listings."""
    
    SOURCE_NAME = "newsource"
    BASE_URL = "https://newsource.com"
    
    async def scrape(self, proxy_url: Optional[str] = None) -> List[ListingData]:
        """Scrape listings from the source."""
        listings = []
        
        try:
            async with httpx.AsyncClient(proxy=proxy_url, timeout=30.0) as client:
                response = await client.get(f"{self.BASE_URL}/api/listings")
                response.raise_for_status()
                data = response.json()
                
                for item in data.get("listings", []):
                    listings.append(self._parse_listing(item))
                    
        except ProxyError as e:
            # Retry without proxy
            self.logger.warning(f"Proxy failed, retrying without: {e}")
            return await self.scrape(proxy_url=None)
        except Exception as e:
            self.logger.error(f"Scrape failed: {e}")
            raise
            
        return listings
    
    def _parse_listing(self, raw: dict) -> ListingData:
        """Parse raw listing data into standardized format."""
        return ListingData(
            source=self.SOURCE_NAME,
            external_id=raw.get("id"),
            address=raw.get("address"),
            city=raw.get("city"),
            state=raw.get("state"),
            zip_code=raw.get("zip"),
            price=raw.get("rent"),
            bedrooms=raw.get("beds"),
            bathrooms=raw.get("baths"),
            sqft=raw.get("sqft"),
            url=raw.get("url"),
        )
```

### Step 2: Register in Factory
Edit `apps/api/app/data/scrapers/scraper_factory.py`:

```python
from app.data.scrapers.providers.newsource_provider import NewSourceProvider

PROVIDERS = {
    # ... existing providers
    "newsource": NewSourceProvider,
}
```

### Step 3: Add to Orchestrator (Optional)
If the scraper should run in nightly jobs, add to orchestrator config.

## Error Handling Patterns

### Proxy Error Recovery
Always implement proxy fallback:
```python
except ProxyError as e:
    self.logger.warning(f"Proxy failed for {self.SOURCE_NAME}, retrying without proxy")
    return await self.scrape(proxy_url=None)
```

### Rate Limiting
```python
import asyncio

async def scrape_with_rate_limit(self, urls: List[str]):
    for i, url in enumerate(urls):
        if i > 0:
            await asyncio.sleep(1.0)  # 1 second delay between requests
        # ... scrape logic
```

### Retry Logic
```python
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def fetch_with_retry(self, url: str):
    # ... fetch logic
```

## Testing
```bash
# Test single scraper
cd apps/api
python -m pytest tests/scrapers/test_newsource.py -v

# Test with scripts
python scripts/test_scrapers.py --provider newsource
```

## Data Model
Ensure listings match the `ListingData` schema:
- `source`: Provider name
- `external_id`: Unique ID from source
- `address`, `city`, `state`, `zip_code`: Location
- `price`: Monthly rent
- `bedrooms`, `bathrooms`, `sqft`: Unit specs
- `url`: Link to original listing

## Checklist
- [ ] Provider class created in `providers/`
- [ ] Registered in `scraper_factory.py`
- [ ] Proxy error handling implemented
- [ ] Rate limiting for respectful scraping
- [ ] Standardized `ListingData` output
- [ ] Unit tests written