# bio-fasta > Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration. - Author: dakesan - Repository: dakesan/cc-dnawork-plugin - Version: 20251227151335 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/dakesan/cc-dnawork-plugin - Web: https://mule.run/skillshub/@@dakesan/cc-dnawork-plugin~bio-fasta:20251227151335 --- --- name: bio-fasta description: "Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration." user_invocable: true --- # Sequence I/O Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ). ## When to Use This Skill This skill should be used when: - Reading or writing sequence files (FASTA, GenBank, FASTQ) - Converting between sequence file formats - Manipulating sequences (complement, reverse complement, translate) - Extracting sequences from large indexed FASTA files (faidx) - Calculating sequence statistics (GC content, molecular weight, Tm) ## When NOT to Use This Skill - **NGS alignment files (SAM/BAM/VCF)** → Use `pysam` - **BLAST searches** → Use `gget` (quick) or `blat-integration` (large-scale) - **Multiple sequence alignment** → Use `msa-advanced` - **Phylogenetic analysis** → Use `etetoolkit` - **NCBI database queries** → Use `pubmed-database` or `gene-database` ## Tool Selection Guide | Task | Tool | Reference | |------|------|-----------| | Parse FASTA/GenBank/FASTQ | `Bio.SeqIO` | `biopython_seqio.md` | | Convert file formats | `Bio.SeqIO.convert()` | `biopython_seqio.md` | | Sequence operations | `Bio.Seq` | `biopython_seqio.md` | | Large FASTA random access | `pysam.FastaFile` + faidx | `faidx.md` | | GC%, Tm, molecular weight | `Bio.SeqUtils` | `utilities.md` | ## Quick Start ### Installation ```bash uv pip install biopython pysam ``` ### Read FASTA ```python from Bio import SeqIO for record in SeqIO.parse("sequences.fasta", "fasta"): print(f"{record.id}: {len(record.seq)} bp") ``` ### Convert GenBank to FASTA ```python from Bio import SeqIO SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta") ``` ### Random Access with faidx ```python import pysam # Create index (once) pysam.faidx("reference.fasta") # Random access fasta = pysam.FastaFile("reference.fasta") seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates fasta.close() ``` ### Sequence Operations ```python from Bio.Seq import Seq seq = Seq("ATGCGATCGATCG") print(seq.complement()) print(seq.reverse_complement()) print(seq.translate()) ``` ## Reference Documentation Consult the appropriate reference file for detailed documentation: ### `references/biopython_seqio.md` - `Bio.Seq` object and sequence operations - `Bio.SeqIO` for file parsing and writing - `SeqRecord` object and annotations - Supported file formats - Format conversion patterns ### `references/faidx.md` - Creating FASTA index with `pysam.faidx()` - `pysam.FastaFile` for random access - Coordinate systems (0-based vs 1-based) - Performance considerations for large files - Common patterns (variant context, gene extraction) ### `references/utilities.md` - GC content calculation (`gc_fraction`) - Molecular weight (`molecular_weight`) - Melting temperature (`MeltingTemp`) - Codon usage analysis - Restriction enzyme sites ### `references/formats.md` - FASTA format specification - GenBank format specification - FASTQ format and quality scores - Format detection and validation ## Coordinate Systems **Biopython**: Uses Python-style 0-based, half-open intervals for slicing. **pysam.FastaFile.fetch()**: - Numeric arguments: 0-based (`fetch("chr1", 999, 2000)` = positions 999-1999) - Region strings: 1-based (`fetch("chr1:1000-2000")` = positions 1000-2000) ## Common Pitfalls 1. **Coordinate confusion**: Remember which tool uses 0-based vs 1-based 2. **Missing faidx index**: Random access requires `.fai` file 3. **Format mismatch**: Verify file format matches the format string in `SeqIO.parse()` 4. **Iterator exhaustion**: `SeqIO.parse()` returns an iterator; convert to list if multiple passes needed 5. **Large files**: Use iterators, not `list()`, for memory efficiency