# nl-sql-query > Query the CDM DuckDB database using natural language. Use this skill when the user wants to query the linkml-coral CDM database with questions in plain English instead of writing SQL. The skill translates natural language to SQL using Claude API and returns formatted results. - Author: marcin p. joachimiak - Repository: realmarcin/linkml-coral - Version: 20260205195021 - Stars: 0 - Forks: 1 - Last Updated: 2026-02-07 - Source: https://github.com/realmarcin/linkml-coral - Web: https://mule.run/skillshub/@@realmarcin/linkml-coral~nl-sql-query:20260205195021 --- --- name: nl-sql-query description: Query the CDM DuckDB database using natural language. Use this skill when the user wants to query the linkml-coral CDM database with questions in plain English instead of writing SQL. The skill translates natural language to SQL using Claude API and returns formatted results. --- # Natural Language SQL Query for CDM Database Query the KBase CDM DuckDB database using natural language questions. This skill translates your questions to SQL and executes them against the database. ## When to Use This Skill Use this skill when: - User asks to query the CDM database with natural language - User wants to explore data without writing SQL - User asks questions like "How many samples are there?" or "Show me reads with high counts" - User wants to invoke `/nl-sql-query` command ## Prerequisites Before using this skill: 1. **CDM Database**: Must exist at `cdm_store.db` (or specify custom path) 2. **API Key**: `ANTHROPIC_API_KEY` environment variable must be set 3. **Load Data**: Run `just load-cdm-store-bricks-64gb` if database doesn't exist ## How It Works 1. **Schema Inspection**: Reads database schema (tables, columns, types, counts) 2. **AI Translation**: Sends question + schema to Claude API 3. **SQL Generation**: Claude generates optimized DuckDB SQL 4. **Execution**: Runs query and formats results 5. **Display**: Returns results in clean table format or JSON ## Usage ### Basic Query ```bash just cdm-nl-query "How many samples are there?" ``` ### JSON Output ```bash just cdm-nl-query-json "List all locations with sample counts" ``` ### Verbose (See Generated SQL) ```bash just cdm-nl-query-verbose "Find reads with read_count over 50000" ``` ### Direct Python Usage ```bash uv run python scripts/cdm_analysis/nl_sql_query.py \ --db cdm_store.db \ "Show me the top 10 samples by depth" ``` ## Example Queries **Quick Examples:** - "How many samples are in the database?" - "Find samples with depth greater than 100" - "Show me the top 10 locations by sample count" - "List reads with read_count over 50000" **For complete examples**: See EXAMPLES.md for 50+ example queries covering: - Basic statistics and counts - Filtered queries (numeric, string, date) - Aggregations and grouping - Joins and relationships - Complex queries with subqueries - Domain-specific analysis (sequencing, taxonomy, geography) - Data quality and validation queries ## Database Schema Overview The CDM database contains these main table types: ### Static Entity Tables (sdt_*) - `sdt_location`: Geographic locations - `sdt_sample`: Sample metadata - `sdt_reads`: Sequencing reads data - `sdt_assembly`: Genome assemblies - `sdt_genome`: Genome annotations - `sdt_gene`: Gene predictions - `sdt_otu` (ASV): 16S amplicon sequence variants ### System Tables (sys_*) - `sys_oterm`: Ontology terms - `sys_type_def`: Type definitions - `sys_process`: Processing records ### Dynamic Tables (ddt_*) - `ddt_brick*`: Measurement array data (21 brick tables) ## Common Foreign Keys - `sample_id`: Links to samples - `location_id`: Links to locations - `reads_id`: Links to reads - `assembly_id`: Links to assemblies - `genome_id`: Links to genomes ## Output Formats ### Text Format (Default) ``` Natural Query: How many samples are there? Generated SQL: SELECT COUNT(*) as total_samples FROM sdt_sample Results (1 rows): total_samples ------------- 1523 ``` ### JSON Format ```json { "natural_query": "How many samples are there?", "sql_query": "SELECT COUNT(*) as total_samples FROM sdt_sample", "result_count": 1, "results": [ {"total_samples": 1523} ] } ``` ## Implementation Steps When user invokes this skill: 1. **Check Prerequisites** ```bash # Verify database exists ls -l cdm_store.db || ls -l cdm_store_bricks_full.db # Check API key echo $ANTHROPIC_API_KEY | grep -q "sk-" && echo "✓ API key set" || echo "✗ Set ANTHROPIC_API_KEY" ``` 2. **Execute Query** ```bash just cdm-nl-query "USER_QUESTION_HERE" ``` 3. **Handle Results** - Display results to user - If query fails, show generated SQL and error - Suggest fixes if SQL is invalid ## Error Handling **Database Not Found:** ```bash # Load the database first just load-cdm-store-bricks-64gb ``` **API Key Missing:** ```bash export ANTHROPIC_API_KEY="your-api-key-here" ``` **Invalid SQL Generated:** - Run with `--verbose` to see generated SQL - Check if table/column names are correct - Rephrase the question more specifically ## Advanced Options ### Custom Database Path ```bash uv run python scripts/cdm_analysis/nl_sql_query.py \ --db /path/to/custom.db \ "your question" ``` ### Custom API Key ```bash uv run python scripts/cdm_analysis/nl_sql_query.py \ --api-key "sk-ant-..." \ "your question" ``` ### Save Results to File ```bash just cdm-nl-query-json "your question" > results.json ``` ## Tips for Better Results 1. **Be Specific**: "Show me samples with depth > 100" instead of "Show me deep samples" 2. **Use Table Names**: Reference actual table names when known (sdt_sample, sdt_reads) 3. **Specify Limits**: "Top 10 locations" instead of "Show locations" 4. **Use Field Names**: "samples with read_count > 50000" uses actual column names 5. **Ask for Context**: "Show samples with their locations" triggers JOINs automatically ## Troubleshooting **Issue**: Query returns no results - Check if data actually exists: `just cdm-store-stats` - Verify table/column names with verbose mode **Issue**: SQL syntax error - Run with `--verbose` to see generated SQL - Check DuckDB documentation for syntax - Rephrase question using simpler terms **Issue**: Query too slow - Add LIMIT clauses: "Show first 100 samples..." - Use indexed columns when possible - Consider using brick tables with caution (large datasets) ## Related Commands ```bash # Show database statistics just cdm-store-stats # Pre-defined queries just cdm-find-samples Location0000001 just cdm-search-oterm "soil" just cdm-lineage Assembly Assembly0000001 # View documentation cat docs/CDM_PARQUET_STORE_GUIDE.md ``` ## Technical Details **Script Location**: `scripts/cdm_analysis/nl_sql_query.py` **Dependencies**: - `anthropic>=0.39.0` - Claude API client - `duckdb` - Database engine - `linkml-store>=0.2.0` - Data management **Model**: Claude Sonnet 4 (claude-sonnet-4-20250514) **Token Limits**: - Schema description: ~1-2K tokens - Query translation: ~1K tokens max - Results returned up to 100 rows by default