# qdrant-collection-setup

> Create and manage Qdrant collections for textbook RAG pipelines. Use when setting up vector databases, changing embedding models, adding payload indexes, debugging retrieval issues, or managing dev/staging/prod collections. Handles vector configuration, payload schemas, indexes, safe lifecycle operations, and validation.

- Author: Abdullah khalid
- Repository: abduIIahKhaIid/physical-ai-robotics-textbook
- Version: 20260131183041
- Stars: 2
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/abduIIahKhaIid/physical-ai-robotics-textbook
- Web: https://mule.run/skillshub/@@abduIIahKhaIid/physical-ai-robotics-textbook~qdrant-collection-setup:20260131183041

---

---
name: qdrant-collection-setup
description: Create and manage Qdrant collections for textbook RAG pipelines. Use when setting up vector databases, changing embedding models, adding payload indexes, debugging retrieval issues, or managing dev/staging/prod collections. Handles vector configuration, payload schemas, indexes, safe lifecycle operations, and validation.
---

# Qdrant Collection Setup

Provision and manage Qdrant vector collections for RAG-powered textbook chatbots.

## When to Use This Skill

Use this skill when:

- **Initial setup**: Creating Qdrant collections for the first time
- **Model changes**: Switching embedding models (dimension changes)
- **Schema updates**: Adding new payload fields or indexes
- **Environment management**: Setting up dev/staging/prod collections
- **Troubleshooting**: Debugging retrieval latency or filter correctness
- **Migrations**: Moving between embedding models or Qdrant versions

## Quick Start

### Basic Collection Creation

```bash
# Local development (small embedding model)
python scripts/setup_collection.py \
    --collection textbook_chunks_dev \
    --vector-size 384 \
    --recreate

# Production (OpenAI embeddings)
python scripts/setup_collection.py \
    --url https://xyz.cloud.qdrant.io:6333 \
    --api-key $QDRANT_API_KEY \
    --collection textbook_chunks \
    --vector-size 1536 \
    --distance Cosine
```

### Common Vector Dimensions

- `text-embedding-ada-002`: 1536
- `text-embedding-3-small`: 1536  
- `text-embedding-3-large`: 3072
- `all-MiniLM-L6-v2`: 384

## Core Workflows

### 1. First-Time Setup

For new projects starting from scratch:

```bash
# Step 1: Choose embedding model
# - Development: all-MiniLM-L6-v2 (384D, fast)
# - Production: text-embedding-3-large (3072D, best quality)

# Step 2: Create collection
python scripts/setup_collection.py \
    --collection textbook_chunks \
    --vector-size 1536 \
    --distance Cosine

# Step 3: Verify setup
python scripts/setup_collection.py \
    --collection textbook_chunks \
    --vector-size 1536 \
    --validate-only

# Collection is ready for embedding insertion
```

### 2. Changing Embedding Models

When switching models (e.g., from ada-002 to text-embedding-3-large):

```bash
# Step 1: Backup current collection
python scripts/migrate_collection.py backup \
    --collection textbook_chunks \
    --backup-name textbook_chunks_backup

# Step 2: Create new collection with new dimensions
python scripts/setup_collection.py \
    --collection textbook_chunks_v2 \
    --vector-size 3072 \
    --recreate

# Step 3: Re-embed content with new model and insert
# (This happens in your embedding pipeline, not this skill)

# Step 4: Validate new collection
python scripts/setup_collection.py \
    --collection textbook_chunks_v2 \
    --vector-size 3072 \
    --validate-only
```

### 3. Adding Payload Indexes

For faster filtered queries, create indexes on frequently-filtered fields:

```python
from qdrant_client import QdrantClient, models

client = QdrantClient(url="...", api_key="...")

# Add index for chapter filtering
client.create_payload_index(
    collection_name="textbook_chunks",
    field_name="chapter",
    field_schema=models.PayloadSchemaType.KEYWORD
)

# Add index for page range queries
client.create_payload_index(
    collection_name="textbook_chunks",
    field_name="page",
    field_schema=models.PayloadSchemaType.INTEGER
)
```

**Recommended indexes** (created automatically by setup script):
- `chapter`: keyword (chapter filtering)
- `section`: keyword (section filtering)  
- `page`: integer (page range queries)
- `type`: keyword (content type filtering)

### 4. Environment Separation

Manage dev/staging/prod collections:

```bash
# Development (local)
python scripts/setup_collection.py \
    --url http://localhost:6333 \
    --collection textbook_chunks_dev \
    --vector-size 384

# Staging (cloud, matches prod config)
python scripts/setup_collection.py \
    --url https://staging.cloud.qdrant.io:6333 \
    --api-key $STAGING_KEY \
    --collection textbook_chunks_staging \
    --vector-size 1536

# Production (cloud)
python scripts/setup_collection.py \
    --url https://prod.cloud.qdrant.io:6333 \
    --api-key $PROD_KEY \
    --collection textbook_chunks \
    --vector-size 1536 \
    --skip-test  # No test data in prod
```

See `references/environment_config.md` for detailed environment management.

## Payload Schema Design

### Standard Schema

Every embedded chunk should include:

```python
payload = {
    # Required fields
    "text": "The actual chunk text...",
    "chapter": "chapter-1",
    "section": "1.1", 
    "page": 42,
    "chunk_id": "unique-id",
    "type": "text",  # or "code", "heading", "list", "table"
    
    # Optional but recommended
    "title": "Section heading",
    "module": "module-1-ros",
    "week": 1,
    "has_code": True,
    "keywords": ["robotics", "ROS2"],
}
```

See `references/payload_schema.md` for complete schema reference and filtering examples.

## Safety Checks

### Pre-Creation Validation

The setup script automatically validates:

1. **Vector dimension**: Matches your embedding model
2. **Collection existence**: Prevents accidental overwrites (unless `--recreate`)
3. **Connection**: Verifies Qdrant is reachable
4. **Indexes**: Creates recommended payload indexes
5. **Test insertion**: Inserts and searches test vector

### Collection Health Check

```bash
# Validate existing collection
python scripts/setup_collection.py \
    --collection textbook_chunks \
    --vector-size 1536 \
    --validate-only

# Output shows:
# ✓ Collection exists
#   Vector size: 1536
#   Distance: COSINE  
#   Points count: 45231
# ✅ Validation passed
```

## Troubleshooting

### Vector Dimension Mismatch

**Error**: `vector dimension mismatch, expected 1536 got 384`

**Solution**: 
```bash
# Check collection config
python scripts/setup_collection.py \
    --collection textbook_chunks \
    --validate-only

# If mismatch, migrate to new collection
python scripts/migrate_collection.py migrate \
    --source textbook_chunks \
    --target textbook_chunks_v2 \
    --vector-size 1536
```

### Slow Filtered Queries

**Symptom**: Queries with filters take >1s

**Solution**: Add payload indexes
```python
# Index the filtered field
client.create_payload_index(
    collection_name="textbook_chunks",
    field_name="chapter",  # Field being filtered
    field_schema=models.PayloadSchemaType.KEYWORD
)
```

### Collection Doesn't Exist

**Error**: `Collection textbook_chunks not found`

**Solution**:
```bash
# Create the collection
python scripts/setup_collection.py \
    --collection textbook_chunks \
    --vector-size 1536
```

### API Key Authentication Failed

**Error**: `Unauthorized` or `403 Forbidden`

**Solution**:
```bash
# Verify API key is correct
echo $QDRANT_API_KEY

# Test connection
python scripts/setup_collection.py \
    --url https://xyz.cloud.qdrant.io:6333 \
    --api-key $QDRANT_API_KEY \
    --collection textbook_chunks \
    --validate-only
```

## Advanced Usage

### Custom Index Configuration

```python
# Create custom index with specific parameters
client.create_payload_index(
    collection_name="textbook_chunks",
    field_name="difficulty",
    field_schema=models.PayloadSchemaType.KEYWORD
)

# Use in queries
results = client.search(
    collection_name="textbook_chunks",
    query_vector=embedding,
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="difficulty",
                match=models.MatchValue(value="beginner")
            )
        ]
    )
)
```

### Collection Versioning

For zero-downtime migrations:

```bash
# Create v2 with new schema
python scripts/setup_collection.py \
    --collection textbook_chunks_v2 \
    --vector-size 3072

# Load data into v2
# (Run embedding pipeline)

# Validate v2 works
python scripts/setup_collection.py \
    --collection textbook_chunks_v2 \
    --validate-only

# Switch application to use v2
# Update config: COLLECTION_NAME=textbook_chunks_v2

# After validation, remove v1
# client.delete_collection("textbook_chunks")
```

## Reference Documentation

- **Payload Schema**: `references/payload_schema.md` - Complete payload field reference, filtering examples, and upsert strategies
- **Environment Config**: `references/environment_config.md` - Dev/staging/prod setup, Qdrant Cloud configuration, backup procedures

## Script Reference

### setup_collection.py

Main collection provisioning script.

**Required arguments:**
- `--vector-size`: Vector dimension (must match embedding model)

**Optional arguments:**
- `--url`: Qdrant server URL (default: http://localhost:6333)
- `--api-key`: API key for cloud deployments
- `--collection`: Collection name (default: textbook_chunks)
- `--distance`: Distance metric (Cosine/Euclidean/Dot, default: Cosine)
- `--recreate`: Delete and recreate if exists
- `--validate-only`: Only validate, don't create
- `--skip-test`: Skip test vector insertion

### migrate_collection.py

Collection migration and backup.

**Commands:**

```bash
# Backup collection
migrate_collection.py backup \
    --collection textbook_chunks \
    --backup-name textbook_chunks_backup

# Migrate to new schema
migrate_collection.py migrate \
    --source textbook_chunks \
    --target textbook_chunks_v2 \
    --vector-size 3072
```

## Best Practices

1. **Always validate** after creation with `--validate-only`
2. **Use environment suffixes** for collection names (dev/staging/prod)
3. **Backup before migrations** using `migrate_collection.py backup`
4. **Match vector dimensions** to your embedding model exactly
5. **Index filtered fields** for query performance
6. **Test in dev first** before applying to production
7. **Use deterministic IDs** to prevent duplicate points
8. **Monitor collection health** with regular validation checks