# data-layer

> Working with OpenBench data layer - vector stores, chunking, embeddings, and RAG patterns. Use when implementing PineconeStore, chunking documents, generating embeddings, or building RAG workflows.

- Author: bejono17
- Repository: ai-kitchen-inc/openbench
- Version: 20260206203508
- Stars: 2
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/ai-kitchen-inc/openbench
- Web: https://mule.run/skillshub/@@ai-kitchen-inc/openbench~data-layer:20260206203508

---

---
name: data-layer
description: Working with OpenBench data layer - vector stores, chunking, embeddings, and RAG patterns. Use when implementing PineconeStore, chunking documents, generating embeddings, or building RAG workflows.
---

# Data Layer

OpenBench data layer handles vector stores, chunking, embeddings, and RAG patterns.

## Chunking

Split documents into chunks for vector indexing:

```python
from openbench.data.stores import ChunkingConfig, chunk_text, chunk_raw_data, Chunk

# Configure chunking
config = ChunkingConfig(
    chunk_size=1000,      # Max chars per chunk
    chunk_overlap=200,    # Overlap between chunks
    separators=["\n\n", "\n", ". ", ", ", " "]  # Split priority
)

# Chunk plain text
chunks = chunk_text(text, config)

# Chunk RawData (preserves metadata)
from openbench.data.sources import PDFSource
raw_data = PDFSource("doc.pdf").extract()
chunks = chunk_raw_data(raw_data, config)  # Returns List[Chunk]
```

## PineconeStore

Vector store with semantic search:

```python
from openbench.data.stores import PineconeStore

# Initialize
store = PineconeStore(
    index_name="my-index",
    namespace="documents",
    embedding_model="text-embedding-3-small",  # OpenAI
    dimension=1536,  # Auto-detected if not specified
)

# Index chunks
store.index_chunks(chunks)

# Semantic search
results = store.search(
    query="What is the revenue?",
    top_k=5,
    filter={"source_type": "pdf"}
)

# Access results
for result in results:
    print(f"Score: {result.score}")
    print(f"Content: {result.content}")
    print(f"Metadata: {result.metadata}")
```

## Exception Handling

```python
from openbench.data.exceptions import (
    DataLayerError,      # Base exception
    SourceError,         # Data source errors
    ExtractionError,     # Extraction failed
    ValidationError,     # Validation failed
    FileNotFoundError,   # File not found
    UnsupportedFormatError,  # Format not supported
)

from openbench.data.stores import (
    StoreError,          # Base store error
    IndexNotFoundError,  # Index doesn't exist
    StoreConnectionError,  # Connection failed
    DimensionMismatchError,  # Vector dimension mismatch
    QuotaExceededError,  # API quota exceeded
    EmbeddingError,      # Embedding generation failed
    ItemNotFoundError,   # Item not in store
    InvalidQueryError,   # Query format invalid
)

# Usage
try:
    results = store.search(query)
except IndexNotFoundError:
    store.create_index()
except EmbeddingError as e:
    logger.error(f"Embedding failed: {e}")
```

## RAG Pattern

Retrieval-Augmented Generation workflow:

```python
from openbench.data.sources import PDFSource
from openbench.data.stores import PineconeStore, ChunkingConfig

# 1. Extract and chunk
source = PDFSource("documents/report.pdf")
raw_data = source.extract()
chunks = chunk_raw_data(raw_data, ChunkingConfig(chunk_size=500))

# 2. Index
store = PineconeStore(index_name="knowledge", namespace="reports")
store.index_chunks(chunks)

# 3. Retrieve
results = store.search(query="revenue 2024", top_k=5)

# 4. Build context
context = "\n\n".join([r.content for r in results])
```

## EmbeddingMixin

Add embedding capabilities to custom stores:

```python
from openbench.data.stores.base import EmbeddingMixin

class MyStore(EmbeddingMixin):
    def __init__(self, embedding_model: str = "text-embedding-3-small"):
        self._embedding_model = embedding_model
        self._dimension = None  # Auto-detect

    def index(self, text: str):
        vector = self._embed(text)  # From mixin
        # Store vector...

    def index_batch(self, texts: list):
        vectors = self._embed_batch(texts, batch_size=100)
        # Store vectors...
```

## Best Practices

1. **Choose chunk size wisely** - 500-1000 chars for Q&A, larger for summarization
2. **Use namespaces** - Separate different document collections
3. **Include metadata** - Source, timestamp, page number for filtering
4. **Handle errors** - Wrap store operations in try/except
5. **Batch operations** - Use batch methods for large datasets

For examples, see `examples/workflows/research/hybrid_research_agent.py`