# knowledge-base-builder > This skill should be used when reading files (Markdown, Text, PDF) and creating or updating a knowledge base with semantic indexing using LanceDB for vector search capabilities. Supports hybrid storage with markdown files for human readability and LanceDB for efficient semantic queries. - Author: netzkontrast - Repository: netzkontrast/coherence-protocoll - Version: 20251202121052 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/netzkontrast/coherence-protocoll - Web: https://mule.run/skillshub/@@netzkontrast/coherence-protocoll~knowledge-base-builder:20251202121052 --- --- name: knowledge-base-builder description: This skill should be used when reading files (Markdown, Text, PDF) and creating or updating a knowledge base with semantic indexing using LanceDB for vector search capabilities. Supports hybrid storage with markdown files for human readability and LanceDB for efficient semantic queries. --- # Knowledge Base Builder Skill ## Purpose This skill enables extracting content from multiple file formats (Markdown, Text, PDF) and building a structured, searchable knowledge base. It combines: - **Markdown-backed storage** for human-readable, version-controllable entries - **LanceDB vector database** for semantic search capabilities - **Metadata tracking** for file management and deduplication - **Flexible categorization** for organizing knowledge across domains ## When to Use This Skill Use this skill when: - Extracting knowledge from documentation files into a searchable system - Building a project-specific knowledge base from multiple sources - Creating semantic search capabilities over structured content - Ingesting technical documentation (code docs, guides, manuals) - Maintaining a versioned, git-friendly knowledge repository - Searching content by meaning, not just keywords ## How to Use This Skill ### 1. Understanding the Approach The skill uses a hybrid architecture: 1. **Source files** → Parse into structured entries 2. **Markdown storage** → Human-readable `.md` files in `.knowledge_base/entries/` 3. **Vector indexing** → LanceDB creates semantic embeddings for search 4. **Metadata tracking** → JSON file tracks all entries and file hashes See `references/kb_schema.md` for detailed schema documentation. ### 2. Core Capabilities #### Ingest Files To add content to the knowledge base, use `scripts/kb_manager.py`: ```bash # Ingest markdown file (split by headers) python scripts/kb_manager.py ingest /path/to/document.md docs # Ingest text file (split into chunks) python scripts/kb_manager.py ingest /path/to/notes.txt general # Ingest PDF (split by pages) python scripts/kb_manager.py ingest /path/to/manual.pdf guides ``` **File type handling:** - **Markdown**: Automatically splits on headers (`#`, `##`, etc.), creating one entry per section - **Text**: Chunks into 500-character segments with sequential ordering - **PDF**: Extracts one entry per page with page number tracking #### Search the Knowledge Base ```bash # Semantic search (if LanceDB available) python scripts/kb_manager.py search "authentication patterns" # Keyword fallback search (without vector DB) python scripts/kb_manager.py search "database" ``` #### List Entries ```bash # List all entries python scripts/kb_manager.py list # List by category python scripts/kb_manager.py list docs ``` #### Retrieve Full Entry ```bash # Get markdown content of specific entry python scripts/kb_manager.py get abc123def456 ``` ### 3. Integration with Claude When using this skill in Claude Code: 1. **Read the Python script** at `scripts/kb_manager.py` to understand the API 2. **Import the manager** for programmatic use: ```python from kb_manager import KnowledgeBaseManager kb = KnowledgeBaseManager() entry_ids = kb.ingest_markdown("path/to/file.md", "docs") results = kb.search("my query") ``` 3. **Use CLI mode** for automated workflows with subprocess calls 4. **Reference the schema** in `references/kb_schema.md` for data structure details ### 4. Configuration Default behavior: - Knowledge base stored in `.knowledge_base/` directory - Vector model: `all-MiniLM-L6-v2` (384-dimensional embeddings) - Text chunk size: 500 characters (configurable in code) - Deduplication: Based on file content hash ### 5. Dependencies **Required:** - Python 3.7+ **Optional (for full functionality):** - `lancedb` - For semantic search with vectors - `sentence-transformers` - For generating embeddings - `pypdf` - For PDF text extraction Install all optional dependencies: ```bash pip install lancedb sentence-transformers pypdf ``` ## Architecture Details ### Entry ID Generation Entry IDs are deterministic MD5 hashes combining source file path and section identifier. This ensures: - Same file re-ingested produces same IDs - Enables deduplication and updates - Provides stable references across sessions ### Semantic Search When LanceDB is available: 1. Entry content encoded to 384-dimensional vector using all-MiniLM-L6-v2 2. Query text encoded to same vector space 3. L2 distance used to find semantically similar entries 4. Results ranked by similarity score Fallback to keyword search if LanceDB unavailable. ### File Deduplication Files tracked by MD5 hash of content: - Hash stored in metadata - Prevents duplicate processing - Allows safe re-ingestion (overwrites with same hash) - Clean audit trail of processed files ## Practical Examples ### Building a Documentation KB ```bash # Ingest all documentation python scripts/kb_manager.py ingest docs/getting-started.md docs python scripts/kb_manager.py ingest docs/api-reference.md docs python scripts/kb_manager.py ingest docs/troubleshooting.md docs # Search semantically python scripts/kb_manager.py search "how do I install the library?" ``` ### Technical Knowledge Base ```bash # Ingest from various sources python scripts/kb_manager.py ingest design/architecture.md architecture python scripts/kb_manager.py ingest notes/patterns.txt patterns python scripts/kb_manager.py ingest manual.pdf reference # Query by topic python scripts/kb_manager.py search "microservices patterns" ``` ## Output The knowledge base creates: - **`.knowledge_base/entries/*.md`** - Individual entry markdown files (git-friendly) - **`.knowledge_base/metadata.json`** - Index and tracking (git-friendly) - **`.knowledge_base/lancedb/`** - Vector index (binary, not git-tracked) All text data is human-readable markdown, suitable for version control and review. ## Troubleshooting **Vector search not working:** - Install dependencies: `pip install lancedb sentence-transformers` - Skill falls back to keyword search automatically if unavailable **PDF extraction issues:** - Install pypdf: `pip install pypdf` - Some PDF types (scanned images) won't extract text **Memory issues with large files:** - Text chunking reduces memory usage (adjust in code if needed) - Markdown headers naturally limit entry size ## See Also - `references/kb_schema.md` - Complete data structure documentation - `scripts/kb_manager.py` - Full implementation and Python API