# query-predictor

> Guide for implementing the SQL Execution Time Prediction Service (predict_time).

- Author: sttking
- Repository: sttking/educate_page
- Version: 20260130165115
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/sttking/educate_page
- Web: https://mule.run/skillshub/@@sttking/educate_page~query-predictor:20260130165115

---

---
name: query-predictor
description: Guide for implementing the SQL Execution Time Prediction Service (predict_time).
---

# Query Predictor Development Skill

Use this skill when developing, debugging, or extending the `service/query/predict_time` module.
This module predicts the execution time of SQL queries by simulating a query optimizer's costing model.

## Critical Rules

### 1. Vendor Setup for SQLGlot
**Important**: When using `sqlglot` (e.g., for parsing SQL into AST or converting AST to IR), you **MUST** import `vendor_setup` at the very top of the file, before any other imports.

```python
import vendor_setup  # MUST be the first import
from sqlglot import exp, parse_one
...
```

### 2. Server Management
After making changes, restart the server to test:

```bash
./stop.sh && ./run.sh
```

Run tests against the running server:

```bash
./test/query/predict_time/predict_time_test.sh http://localhost:8099
```

## Architecture & Implementation

### 1. Pipeline Overview
The prediction pipeline (`predict_time_service.py`) follows these steps:
1.  **SQL Parsing**: Use `sqlglot` to parse SQL into AST.
2.  **IR Generation**: Convert AST to internal Logical Plan (IR) nodes (Scan, Filter, Join, Agg).
3.  **Cardinality Estimation**: Estimate rows for each node using stats or default selectivity.
4.  **Physical Plan Generation**: Generate physical plans (SeqScan vs IndexScan, HashJoin vs NestedLoop).
5.  **Costing**: Calculate IO/CPU cost for each node in the plan.
6.  **Time Prediction**: Convert abstract cost to seconds using machine profiles.
7.  **Fallback**: If parsing fails, use heuristic-based estimation with lower confidence.

### 2. File Structure

```
service/query/predict_time/
├── __init__.py
├── cost_model.py              # Cost calculation (IO + CPU → time)
├── risk_analyzer.py           # Risk factor identification
├── ir/
│   ├── __init__.py
│   └── logical_nodes.py       # LogicalNode, ScanNode, JoinNode, etc.
├── parser/
│   ├── __init__.py
│   ├── parser_utils.py        # SQLGlot parsing utilities
│   └── ir_generator.py        # AST → IR transformation
├── estimator/
│   ├── __init__.py
│   └── cardinality_estimator.py  # Selectivity/cardinality estimation
├── planner/
│   ├── __init__.py
│   └── plan_generator.py      # Physical plan generation
└── config/
    ├── __init__.py
    └── machine_profile.py     # Hardware profiles for calibration
```

### 3. Component Responsibility

| File | Responsibility | Status |
|------|----------------|--------|
| `predict_time_service.py` | Orchestrate pipeline, fallback logic | ✅ Complete |
| `cost_model.py` | Calculate IO/CPU cost, convert to time | ✅ Complete |
| `risk_analyzer.py` | Identify reliability risks | ✅ Complete |
| `ir/logical_nodes.py` | IR node dataclasses | ✅ Complete |
| `parser/parser_utils.py` | SQLGlot parsing utilities | ✅ Complete |
| `parser/ir_generator.py` | AST → IR transformation | ✅ Complete |
| `estimator/cardinality_estimator.py` | Cardinality estimation | ✅ Complete |
| `planner/plan_generator.py` | Physical plan generation | ✅ Complete |
| `config/machine_profile.py` | Machine profiles | ✅ Complete |

## Key Design Decisions

### Cardinality Estimation Defaults
When statistics are not available, use these default selectivities:

```python
DEFAULT_SELECTIVITY = {
    "eq": 0.01,           # col = value (1% match)
    "range": 0.33,        # col > value
    "like_prefix": 0.25,  # col LIKE 'prefix%'
    "in": 0.02,           # col IN (...)
    "is_null": 0.01,      # col IS NULL
}
DEFAULT_TABLE_ROWS = 10_000
```

### Cost → Time Conversion
```
time = (io_cost / iops) + (cpu_cost / cpu_speed)
```

Adjusted for cache hits and partial overlap between IO and CPU operations.

### Graceful Degradation
- **Parse failure** → Use heuristic fallback, confidence = 0.3-0.5
- **Missing stats** → Use defaults, reduce confidence by 50%
- **Pipeline error** → Log warning, return fallback result

## Testing

### Running Tests

```bash
# Start server
./run.sh

# Run integration tests
./test/query/predict_time/predict_time_test.sh http://localhost:8099
```

### Test Scenarios
1. Simple SELECT with stats → High confidence (>0.7)
2. Complex JOIN (3+ tables) → Medium confidence (0.5-0.7)
3. No stats provided → Low confidence (<0.5)
4. Invalid SQL → Graceful degradation, uses fallback
5. Large table scan → Risk factors include warning

## Code Style & Conventions
- **Typing**: Use strict type hints (`List`, `Optional`, `Dict`).
- **DTOs**: Use `dto/predict_time_dto.py` for API data exchange.
- **Error Handling**: Fail gracefully with low confidence score rather than crashing.
- **Logging**: Use emoji prefixes (✅ 📊 🔍 ⚠️ ❌) for log categories.