# bulk-rnaseq-counts-to-de-deseq2

> Perform differential expression analysis using DESeq2 on RNA-seq raw count data. Use when you have integer count matrices with biological replicates (n≥2 per group), need log fold change shrinkage for gene ranking, or want conservative p-value estimates. Best for medium to large sample sizes (n≥4 recommended). Creates DESeqDataSet objects, performs size factor normalization, estimates dispersions, and tests for differential expression using the Wald test or likelihood ratio test.

- Author: gm-phylo
- Repository: phylo-dev/workflows
- Version: 20260129135929
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/phylo-dev/workflows
- Web: https://mule.run/skillshub/@@phylo-dev/workflows~bulk-rnaseq-counts-to-de-deseq2:20260129135929

---

---
name: bulk-rnaseq-counts-to-de-deseq2
description: Perform differential expression analysis using DESeq2 on RNA-seq raw count data. Use when you have integer count matrices with biological replicates (n≥2 per group), need log fold change shrinkage for gene ranking, or want conservative p-value estimates. Best for medium to large sample sizes (n≥4 recommended). Creates DESeqDataSet objects, performs size factor normalization, estimates dispersions, and tests for differential expression using the Wald test or likelihood ratio test.
---

# DESeq2 Differential Expression Analysis

Core DESeq2 workflow for RNA-seq differential expression analysis with count data.

## When to Use This Skill

Use DESeq2 when you have:
- ✅ **Raw integer count data** (not normalized TPM/FPKM)
- ✅ **Biological replicates** (≥2 per condition, ≥4 recommended)
- ✅ Need for **log fold change shrinkage** (ranking/visualization)
- ✅ **Medium to large sample sizes** (DESeq2's strength)

**Don't use DESeq2 for:**
- ❌ Normalized data (TPM/FPKM) → use limma-voom instead
- ❌ Very small samples (n=2-3) → consider edgeR quasi-likelihood

## Installation

```r
if (!require('BiocManager', quietly = TRUE))
    install.packages('BiocManager')

BiocManager::install(c('DESeq2', 'apeglm'))
```

**License:** LGPL (>= 3) (commercial use permitted)

## Inputs

**Required:**
- **Count matrix**: Raw integer counts (genes × samples)
  - Rows = genes (any identifier: Ensembl, symbols, etc.)
  - Columns = samples
  - Values = non-negative integers
- **Sample metadata**: Data frame with sample information
  - Row names must match count matrix column names
  - Required column: `condition` (factor with 2+ levels)
  - Optional: batch, covariates for complex designs

**Alternative inputs:**
- Salmon/Kallisto output (via tximport)
- SummarizedExperiment object
- featureCounts/HTSeq output

**See [references/deseq2-reference.md](references/deseq2-reference.md#alternative-input-formats) for code examples.**

## Outputs

**Primary results:**
- `deseq2_results.csv` - Full differential expression table (baseMean, log2FC, lfcSE, pvalue, padj)
- `deseq2_results_shrunk.csv` - Shrunken LFC for visualization/ranking
- `dds_object.rds` - DESeqDataSet for further analysis

**Normalized data:**
- `normalized_counts.csv` - Size-factor normalized counts
- `vst_transformed.csv` / `rlog_transformed.csv` - Variance-stabilized values

**QC plots (SVG, 300 DPI):**
- `dispersion_plot.svg` - Dispersion estimates vs mean
- `pca_plot.svg` - Principal component analysis
- `ma_plot.svg` - Mean-average plot

## Clarification Questions

Before starting, gather:

1. **Data Format**:
   - Raw count matrix (CSV/TSV: genes × samples)?
   - SummarizedExperiment object?
   - Salmon/Kallisto output?
   - Bioconductor data package (airway, pasilla)?
   - ⚠️ **Normalized data (TPM/FPKM)**: Use limma-voom instead

2. **Sample Metadata**:
   - What is the primary comparison (e.g., treated vs control)?
   - Which group is the control (reference level)?
   - Any covariates to adjust for (batch, sex, sequencing run)?

3. **Experimental Design** (CRITICAL):
   - **Simple two-group** (`~ condition`): No covariates
   - **Multi-factor** (`~ batch + condition`): Adjust for batch effects
   - **Paired samples** (`~ individual + condition`): Matched/paired
   - **Interaction** (`~ genotype * treatment`): Test if effect differs by group

4. **Sample Size**:
   - n ≥ 4 per group recommended
   - n = 2-3: Consider edgeR instead
   - n < 2: Insufficient for testing

5. **Significance Thresholds**:
   - Standard: padj < 0.05, |log2FC| > 1
   - Relaxed: padj < 0.1
   - Stringent: padj < 0.01, |log2FC| > 2

6. **Analysis Goals**:
   - Simple pairwise comparison?
   - Multiple comparisons?
   - Need LFC shrinkage for visualization?

## Standard Workflow

### Complete Analysis in 6 Steps

```r
library(DESeq2)
library(apeglm)

# 1. Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(
    countData = counts,        # Matrix: genes × samples (integers)
    colData = coldata,          # Data frame: sample metadata
    design = ~ condition        # Formula: experimental design
)

# 2. Pre-filter low count genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# 3. Set reference level (control as baseline)
dds$condition <- relevel(dds$condition, ref = 'control')

# 4. Run DESeq pipeline
dds <- DESeq(dds)

# 5. Extract results with shrinkage (recommended)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')

# 6. View and filter results
summary(resLFC)
sig <- subset(resLFC, padj < 0.05 & abs(log2FoldChange) > 1)
```

## Design Formulas

Common patterns:

```r
# Simple two-group
design = ~ condition

# Control for batch
design = ~ batch + condition

# Paired samples
design = ~ individual + condition

# Interaction (test if treatment effect differs by genotype)
design = ~ genotype * treatment
```

**⚠️ Design must not be confounded** - ensure batches exist in both conditions.

**See [references/deseq2-reference.md#design-formulas](references/deseq2-reference.md#design-formulas) for comprehensive examples.**

## Extracting Results

### By Coefficient Name

```r
# See available coefficients
resultsNames(dds)

# Extract specific comparison
res <- results(dds, name = 'condition_treated_vs_control')
```

### By Contrast

```r
res <- results(dds, contrast = c('condition', 'treated', 'control'))
# Format: c(factor, numerator, denominator)
```

**Script:** Use [scripts/extract_results.R](scripts/extract_results.R) for standard extraction patterns.

## Log Fold Change Shrinkage

**Use shrinkage for:**
- ✅ Ranking genes by effect size
- ✅ Visualization (volcano, MA plots)
- ✅ Selecting top genes for validation

**Don't use for:**
- ❌ Hypothesis testing (use unshrunk `results()`)

```r
# apeglm (recommended)
resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
```

**Script:** [scripts/extract_results.R](scripts/extract_results.R) - `apply_lfc_shrinkage()` function

## Normalization & Transformations

```r
# Get normalized counts
normalized_counts <- counts(dds, normalized = TRUE)

# For visualization - auto-select transformation
source("scripts/transformations.R")
transformed <- transform_counts(dds, method = "auto")
# Uses vst() for >30 samples, rlog() for <30 samples
```

**Script:** [scripts/transformations.R](scripts/transformations.R) - All transformation functions

**Decision:** vst() for >30 samples (fast), rlog() for <30 samples (accurate)

## Quality Control

```r
# Run all QC checks
source("scripts/qc_plots.R")
run_all_qc(dds, res, output_dir = "qc_plots")
```

**Script:** [scripts/qc_plots.R](scripts/qc_plots.R) - Generates dispersion, PCA, and MA plots

**Key checks:**
- **Dispersion plot**: Points should cluster around fitted trend
- **PCA plot**: Samples should cluster by condition
- **MA plot**: Should be symmetric around zero

## Exporting Results

```r
# Export all standard outputs
source("scripts/export_results.R")
export_all(dds, res, res_shrunk, output_dir = "deseq2_results")
```

**Script:** [scripts/export_results.R](scripts/export_results.R) - All export functions

Exports:
- All results and shrunk LFC
- Significant genes (customizable thresholds)
- Normalized and transformed counts
- Top genes by significance

## Decision Points

### Decision 1: Transformation Method

**When:** Before creating PCA plots and heatmaps

**Options:**
- **vst()**: Use for >30 samples (fast, suitable for large datasets)
- **rlog()**: Use for <30 samples (better for small samples, slower)

**See [references/decision-guide.md#decision-point-1](references/decision-guide.md#decision-point-1-transformation-method) for detailed guidance.**

### Decision 2: LFC Shrinkage Method

**When:** Before ranking genes or creating MA/volcano plots

**Options:**
- **apeglm** (recommended): Best shrinkage, preserves large LFC
- **ashr**: Good for large datasets or when apeglm is slow
- **normal**: Legacy method, not recommended

**See [references/decision-guide.md#decision-point-2](references/decision-guide.md#decision-point-2-lfc-shrinkage-method) for detailed guidance.**

### Decision 3: Design Formula

**When:** Before creating DESeqDataSet

**Options:**
- **~ condition**: Simple design, no known batch effects
- **~ batch + condition**: Known batch effects (requires balanced design)
- **~ individual + condition**: Paired samples
- **~ genotype * treatment**: Test interactions

**Check PCA first** - if samples cluster by batch, add batch to design.

**See [references/decision-guide.md#decision-point-3](references/decision-guide.md#decision-point-3-design-formula) for detailed guidance.**

## Common Issues

| Issue | Solution | Details |
|-------|----------|---------|
| "design matrix not full rank" | Remove confounded variables or combine into single factor | [Troubleshooting](references/troubleshooting.md#error-the-model-matrix-is-not-full-rank) |
| "counts should be integers" | Use `DESeqDataSetFromTximport()` for tximport data | [Troubleshooting](references/troubleshooting.md#error-counts-matrix-should-contain-integer-values) |
| "factor levels not in colData" | Check spelling in design formula vs colData columns | [Troubleshooting](references/troubleshooting.md#error-factor-levels-not-in-coldata) |
| NA values in padj | Normal - independent filtering removes low-count genes | [Troubleshooting](references/troubleshooting.md#too-many-na-values-in-padj-column) |
| No significant genes | Check PCA for batch effects, verify reference level | [Troubleshooting](references/troubleshooting.md#no-significant-genes-found) |

**See [references/troubleshooting.md](references/troubleshooting.md) for comprehensive troubleshooting guide.**

## Best Practices

1. ✅ **Always pre-filter** low-count genes before `DESeq()`
2. ✅ **Set reference level** explicitly with `relevel()`
3. ✅ **Use padj** (not pvalue) for significance
4. ✅ **Use shrinkage** for visualization, unshrunk for testing
5. ✅ **Check QC plots** before trusting results
6. ✅ **Use vst()** for >30 samples, rlog() for <30 samples
7. ✅ **Document design formula** and report DESeq2 version

## Related Workflows

**Alternative methods (use instead of this skill):**
- **[edger](../edger/)** - Use for small samples (n=2-3) or many contrasts
- **[limma-voom](../limma-voom/)** - Use for normalized data (TPM/FPKM)

**Prerequisites (optional, run before):**
- **[batch-correction](../batch-correction/)** - Remove batch effects before DE

**Downstream (run after this skill):**
- **[de-visualization](../de-visualization/)** - Volcano plots, MA plots, heatmaps
- **[de-results](../de-results/)** - Filter and annotate significant genes
- **pathway-analysis** - GO/KEGG enrichment
- **gsea** - Gene set enrichment analysis

## References

**Detailed documentation:**
- [references/deseq2-reference.md](references/deseq2-reference.md) - Complete code patterns and examples
- [references/decision-guide.md](references/decision-guide.md) - Detailed decision-making guidance
- [references/troubleshooting.md](references/troubleshooting.md) - Comprehensive error solutions

**Scripts:**
- [scripts/basic_workflow.R](scripts/basic_workflow.R) - Complete example workflow
- [scripts/qc_plots.R](scripts/qc_plots.R) - Quality control functions
- [scripts/extract_results.R](scripts/extract_results.R) - Results extraction functions
- [scripts/export_results.R](scripts/export_results.R) - Export functions
- [scripts/transformations.R](scripts/transformations.R) - Transformation functions

**Official documentation:**
- DESeq2 Bioconductor: http://bioconductor.org/packages/DESeq2
- DESeq2 paper: Love et al. (2014) Genome Biology

**License:** LGPL (>= 3) (commercial use permitted)