# bio-fasta

> Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration.

- Author: dakesan
- Repository: dakesan/cc-dnawork-plugin
- Version: 20251227151335
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/dakesan/cc-dnawork-plugin
- Web: https://mule.run/skillshub/@@dakesan/cc-dnawork-plugin~bio-fasta:20251227151335

---

---
name: bio-fasta
description: "Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration."
user_invocable: true
---

# Sequence I/O

Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).

## When to Use This Skill

This skill should be used when:

- Reading or writing sequence files (FASTA, GenBank, FASTQ)
- Converting between sequence file formats
- Manipulating sequences (complement, reverse complement, translate)
- Extracting sequences from large indexed FASTA files (faidx)
- Calculating sequence statistics (GC content, molecular weight, Tm)

## When NOT to Use This Skill

- **NGS alignment files (SAM/BAM/VCF)** → Use `pysam`
- **BLAST searches** → Use `gget` (quick) or `blat-integration` (large-scale)
- **Multiple sequence alignment** → Use `msa-advanced`
- **Phylogenetic analysis** → Use `etetoolkit`
- **NCBI database queries** → Use `pubmed-database` or `gene-database`

## Tool Selection Guide

| Task | Tool | Reference |
|------|------|-----------|
| Parse FASTA/GenBank/FASTQ | `Bio.SeqIO` | `biopython_seqio.md` |
| Convert file formats | `Bio.SeqIO.convert()` | `biopython_seqio.md` |
| Sequence operations | `Bio.Seq` | `biopython_seqio.md` |
| Large FASTA random access | `pysam.FastaFile` + faidx | `faidx.md` |
| GC%, Tm, molecular weight | `Bio.SeqUtils` | `utilities.md` |

## Quick Start

### Installation

```bash
uv pip install biopython pysam
```

### Read FASTA

```python
from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")
```

### Convert GenBank to FASTA

```python
from Bio import SeqIO

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
```

### Random Access with faidx

```python
import pysam

# Create index (once)
pysam.faidx("reference.fasta")

# Random access
fasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 1000, 2000)  # 0-based coordinates
fasta.close()
```

### Sequence Operations

```python
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())
```

## Reference Documentation

Consult the appropriate reference file for detailed documentation:

### `references/biopython_seqio.md`

- `Bio.Seq` object and sequence operations
- `Bio.SeqIO` for file parsing and writing
- `SeqRecord` object and annotations
- Supported file formats
- Format conversion patterns

### `references/faidx.md`

- Creating FASTA index with `pysam.faidx()`
- `pysam.FastaFile` for random access
- Coordinate systems (0-based vs 1-based)
- Performance considerations for large files
- Common patterns (variant context, gene extraction)

### `references/utilities.md`

- GC content calculation (`gc_fraction`)
- Molecular weight (`molecular_weight`)
- Melting temperature (`MeltingTemp`)
- Codon usage analysis
- Restriction enzyme sites

### `references/formats.md`

- FASTA format specification
- GenBank format specification
- FASTQ format and quality scores
- Format detection and validation

## Coordinate Systems

**Biopython**: Uses Python-style 0-based, half-open intervals for slicing.

**pysam.FastaFile.fetch()**:
- Numeric arguments: 0-based (`fetch("chr1", 999, 2000)` = positions 999-1999)
- Region strings: 1-based (`fetch("chr1:1000-2000")` = positions 1000-2000)

## Common Pitfalls

1. **Coordinate confusion**: Remember which tool uses 0-based vs 1-based
2. **Missing faidx index**: Random access requires `.fai` file
3. **Format mismatch**: Verify file format matches the format string in `SeqIO.parse()`
4. **Iterator exhaustion**: `SeqIO.parse()` returns an iterator; convert to list if multiple passes needed
5. **Large files**: Use iterators, not `list()`, for memory efficiency