# identify-genes

> Match essential genes in target genome using orthology and literature

- Author: Hannes Bretschneider
- Repository: katalyzeAI/dsrna-designer
- Version: 20260120113046
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/katalyzeAI/dsrna-designer
- Web: https://mule.run/skillshub/@@katalyzeAI/dsrna-designer~identify-genes:20260120113046

---

---
name: identify-genes
description: Match essential genes in target genome using orthology and literature
---

# Identify Essential Genes Skill

## When to Use This Skill

Use after genome fetch to identify essential genes in the target species that
are good dsRNA candidates.

## Data Storage Structure

**Reads from:**
- `data/{assembly}/genome.fasta` - Cached genome (input data)
- `data/essential_genes.json` - Reference database (input data)
- `output/{run}/literature_search.json` - Analysis output from previous step

**Writes to:**
- `output/{run}/essential_genes.json` - Matched essential genes
- `output/{run}/figures/` - Visualization plots

## Automatic Literature Validation

When evaluating candidate genes, **automatically search PubMed** for supporting evidence:
```
pubmed_search_articles
query: "{gene_name}" AND (RNAi OR dsRNA) AND insect
max_results: 10
```

**Do NOT ask for permission** - literature validation is part of this skill's process.
Search for top candidate genes to validate essentiality claims.

## Instructions

### Step 1: Load Essential Genes Database

Use `read_file` to load `data/essential_genes.json`

This database contains ~40 curated essential insect genes with:
- Gene names and aliases
- Functions
- Species where essentiality is confirmed
- Literature references

### Step 2: Run Matching Script

Use the bundled Python script:

```bash
python .deepagents/skills/identify-genes/scripts/match_essential.py \
  --genome data/{assembly}/genome.fasta \
  --essential-db data/essential_genes.json \
  --literature output/{run}/literature_search.json \
  --output output/{run}/essential_genes.json
```

**Note:** The `--literature` argument is optional. If `literature_search.json`
doesn't exist (because literature-search wasn't run), omit this flag:

```bash
python .deepagents/skills/identify-genes/scripts/match_essential.py \
  --genome data/{assembly}/genome.fasta \
  --essential-db data/essential_genes.json \
  --output output/{run}/essential_genes.json
```

The script:
1. Parses FASTA annotations
2. Matches gene names/aliases against annotations
3. Scores by orthology + literature support (if available)
4. Returns top 20 with sequences

### Step 3: Verify Results

Use `shell`:

```bash
jq 'length' output/{run}/essential_genes.json
```

Should show ~20 genes (or fewer if genome poorly annotated)

### Step 4: Generate Visualization

Create plots showing gene rankings:

```bash
python .deepagents/skills/identify-genes/scripts/plot_genes.py \
  --genes output/{run}/essential_genes.json \
  --output-dir output/{run}/figures/
```

This creates:
- `gene_ranking.png` - Horizontal bar chart of top 10 genes with scores
- `gene_evidence_breakdown.png` - Stacked bar showing evidence sources (orthology/literature)
- `gene_length_distribution.png` - CDS lengths for identified genes

### Step 5: Present Results

Output this summary to the user:

```
## Identify Genes Complete

**Summary:**
- {gene_count} essential genes identified in genome
- Top gene: {top_gene} (score: {score})
- {literature_supported} genes have literature support

**Top 5 Genes:**
| Rank | Gene | Score | Literature | Species Evidence |
|------|------|-------|------------|------------------|
| 1 | ... | ... | ... | ... |

**Files Created:**
- `output/{run}/essential_genes.json`
- `output/{run}/figures/gene_ranking.png`

**Figures:** [Show gene ranking plot]

---
Proceed to design-dsrna for top 5 genes? (yes/no)
```

## Scoring Logic

Each matched gene receives a score from 0 to 1:

| Component | Points | Condition |
|-----------|--------|-----------|
| Base (ortholog match) | 0.50 | Gene name/alias found in genome |
| Literature support | +0.30 | Gene mentioned in PubMed results |
| Multi-species essential | +0.05 per species | Up to +0.20 max |

**Maximum score: 1.0**

## Output Format

`output/{run}/essential_genes.json`:
```json
[
  {
    "gene_id": "lcl|NC_XXX_cds_XP_XXX",
    "gene_name": "vATPase",
    "function": "Vacuolar proton pump - essential for pH homeostasis",
    "score": 0.85,
    "evidence": {
      "ortholog_match": true,
      "literature_support": true,
      "essential_in_species": ["D. melanogaster", "T. castaneum"]
    },
    "sequence": "ATGCGT...",
    "sequence_length": 1842
  }
]
```

## Expected Output

All outputs go in `output/{run}/`:
- `essential_genes.json`
- `figures/gene_ranking.png`
- `figures/gene_evidence_breakdown.png`
- `figures/gene_length_distribution.png`

## Available Tools

- `read_file` - Load databases
- `shell` - Run Python script and plotting
- `write_file` - Save results