# bulk-rnaseq-counts-to-de-deseq2 > Perform differential expression analysis using DESeq2 on RNA-seq raw count data. Use when you have integer count matrices with biological replicates (n≥2 per group), need log fold change shrinkage for gene ranking, or want conservative p-value estimates. Best for medium to large sample sizes (n≥4 recommended). Creates DESeqDataSet objects, performs size factor normalization, estimates dispersions, and tests for differential expression using the Wald test or likelihood ratio test. - Author: gm-phylo - Repository: phylo-dev/workflows - Version: 20260129135929 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/phylo-dev/workflows - Web: https://mule.run/skillshub/@@phylo-dev/workflows~bulk-rnaseq-counts-to-de-deseq2:20260129135929 --- --- name: bulk-rnaseq-counts-to-de-deseq2 description: Perform differential expression analysis using DESeq2 on RNA-seq raw count data. Use when you have integer count matrices with biological replicates (n≥2 per group), need log fold change shrinkage for gene ranking, or want conservative p-value estimates. Best for medium to large sample sizes (n≥4 recommended). Creates DESeqDataSet objects, performs size factor normalization, estimates dispersions, and tests for differential expression using the Wald test or likelihood ratio test. --- # DESeq2 Differential Expression Analysis Core DESeq2 workflow for RNA-seq differential expression analysis with count data. ## When to Use This Skill Use DESeq2 when you have: - ✅ **Raw integer count data** (not normalized TPM/FPKM) - ✅ **Biological replicates** (≥2 per condition, ≥4 recommended) - ✅ Need for **log fold change shrinkage** (ranking/visualization) - ✅ **Medium to large sample sizes** (DESeq2's strength) **Don't use DESeq2 for:** - ❌ Normalized data (TPM/FPKM) → use limma-voom instead - ❌ Very small samples (n=2-3) → consider edgeR quasi-likelihood ## Installation ```r if (!require('BiocManager', quietly = TRUE)) install.packages('BiocManager') BiocManager::install(c('DESeq2', 'apeglm')) ``` **License:** LGPL (>= 3) (commercial use permitted) ## Inputs **Required:** - **Count matrix**: Raw integer counts (genes × samples) - Rows = genes (any identifier: Ensembl, symbols, etc.) - Columns = samples - Values = non-negative integers - **Sample metadata**: Data frame with sample information - Row names must match count matrix column names - Required column: `condition` (factor with 2+ levels) - Optional: batch, covariates for complex designs **Alternative inputs:** - Salmon/Kallisto output (via tximport) - SummarizedExperiment object - featureCounts/HTSeq output **See [references/deseq2-reference.md](references/deseq2-reference.md#alternative-input-formats) for code examples.** ## Outputs **Primary results:** - `deseq2_results.csv` - Full differential expression table (baseMean, log2FC, lfcSE, pvalue, padj) - `deseq2_results_shrunk.csv` - Shrunken LFC for visualization/ranking - `dds_object.rds` - DESeqDataSet for further analysis **Normalized data:** - `normalized_counts.csv` - Size-factor normalized counts - `vst_transformed.csv` / `rlog_transformed.csv` - Variance-stabilized values **QC plots (SVG, 300 DPI):** - `dispersion_plot.svg` - Dispersion estimates vs mean - `pca_plot.svg` - Principal component analysis - `ma_plot.svg` - Mean-average plot ## Clarification Questions Before starting, gather: 1. **Data Format**: - Raw count matrix (CSV/TSV: genes × samples)? - SummarizedExperiment object? - Salmon/Kallisto output? - Bioconductor data package (airway, pasilla)? - ⚠️ **Normalized data (TPM/FPKM)**: Use limma-voom instead 2. **Sample Metadata**: - What is the primary comparison (e.g., treated vs control)? - Which group is the control (reference level)? - Any covariates to adjust for (batch, sex, sequencing run)? 3. **Experimental Design** (CRITICAL): - **Simple two-group** (`~ condition`): No covariates - **Multi-factor** (`~ batch + condition`): Adjust for batch effects - **Paired samples** (`~ individual + condition`): Matched/paired - **Interaction** (`~ genotype * treatment`): Test if effect differs by group 4. **Sample Size**: - n ≥ 4 per group recommended - n = 2-3: Consider edgeR instead - n < 2: Insufficient for testing 5. **Significance Thresholds**: - Standard: padj < 0.05, |log2FC| > 1 - Relaxed: padj < 0.1 - Stringent: padj < 0.01, |log2FC| > 2 6. **Analysis Goals**: - Simple pairwise comparison? - Multiple comparisons? - Need LFC shrinkage for visualization? ## Standard Workflow ### Complete Analysis in 6 Steps ```r library(DESeq2) library(apeglm) # 1. Create DESeqDataSet dds <- DESeqDataSetFromMatrix( countData = counts, # Matrix: genes × samples (integers) colData = coldata, # Data frame: sample metadata design = ~ condition # Formula: experimental design ) # 2. Pre-filter low count genes keep <- rowSums(counts(dds)) >= 10 dds <- dds[keep,] # 3. Set reference level (control as baseline) dds$condition <- relevel(dds$condition, ref = 'control') # 4. Run DESeq pipeline dds <- DESeq(dds) # 5. Extract results with shrinkage (recommended) resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm') # 6. View and filter results summary(resLFC) sig <- subset(resLFC, padj < 0.05 & abs(log2FoldChange) > 1) ``` ## Design Formulas Common patterns: ```r # Simple two-group design = ~ condition # Control for batch design = ~ batch + condition # Paired samples design = ~ individual + condition # Interaction (test if treatment effect differs by genotype) design = ~ genotype * treatment ``` **⚠️ Design must not be confounded** - ensure batches exist in both conditions. **See [references/deseq2-reference.md#design-formulas](references/deseq2-reference.md#design-formulas) for comprehensive examples.** ## Extracting Results ### By Coefficient Name ```r # See available coefficients resultsNames(dds) # Extract specific comparison res <- results(dds, name = 'condition_treated_vs_control') ``` ### By Contrast ```r res <- results(dds, contrast = c('condition', 'treated', 'control')) # Format: c(factor, numerator, denominator) ``` **Script:** Use [scripts/extract_results.R](scripts/extract_results.R) for standard extraction patterns. ## Log Fold Change Shrinkage **Use shrinkage for:** - ✅ Ranking genes by effect size - ✅ Visualization (volcano, MA plots) - ✅ Selecting top genes for validation **Don't use for:** - ❌ Hypothesis testing (use unshrunk `results()`) ```r # apeglm (recommended) resLFC <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm') ``` **Script:** [scripts/extract_results.R](scripts/extract_results.R) - `apply_lfc_shrinkage()` function ## Normalization & Transformations ```r # Get normalized counts normalized_counts <- counts(dds, normalized = TRUE) # For visualization - auto-select transformation source("scripts/transformations.R") transformed <- transform_counts(dds, method = "auto") # Uses vst() for >30 samples, rlog() for <30 samples ``` **Script:** [scripts/transformations.R](scripts/transformations.R) - All transformation functions **Decision:** vst() for >30 samples (fast), rlog() for <30 samples (accurate) ## Quality Control ```r # Run all QC checks source("scripts/qc_plots.R") run_all_qc(dds, res, output_dir = "qc_plots") ``` **Script:** [scripts/qc_plots.R](scripts/qc_plots.R) - Generates dispersion, PCA, and MA plots **Key checks:** - **Dispersion plot**: Points should cluster around fitted trend - **PCA plot**: Samples should cluster by condition - **MA plot**: Should be symmetric around zero ## Exporting Results ```r # Export all standard outputs source("scripts/export_results.R") export_all(dds, res, res_shrunk, output_dir = "deseq2_results") ``` **Script:** [scripts/export_results.R](scripts/export_results.R) - All export functions Exports: - All results and shrunk LFC - Significant genes (customizable thresholds) - Normalized and transformed counts - Top genes by significance ## Decision Points ### Decision 1: Transformation Method **When:** Before creating PCA plots and heatmaps **Options:** - **vst()**: Use for >30 samples (fast, suitable for large datasets) - **rlog()**: Use for <30 samples (better for small samples, slower) **See [references/decision-guide.md#decision-point-1](references/decision-guide.md#decision-point-1-transformation-method) for detailed guidance.** ### Decision 2: LFC Shrinkage Method **When:** Before ranking genes or creating MA/volcano plots **Options:** - **apeglm** (recommended): Best shrinkage, preserves large LFC - **ashr**: Good for large datasets or when apeglm is slow - **normal**: Legacy method, not recommended **See [references/decision-guide.md#decision-point-2](references/decision-guide.md#decision-point-2-lfc-shrinkage-method) for detailed guidance.** ### Decision 3: Design Formula **When:** Before creating DESeqDataSet **Options:** - **~ condition**: Simple design, no known batch effects - **~ batch + condition**: Known batch effects (requires balanced design) - **~ individual + condition**: Paired samples - **~ genotype * treatment**: Test interactions **Check PCA first** - if samples cluster by batch, add batch to design. **See [references/decision-guide.md#decision-point-3](references/decision-guide.md#decision-point-3-design-formula) for detailed guidance.** ## Common Issues | Issue | Solution | Details | |-------|----------|---------| | "design matrix not full rank" | Remove confounded variables or combine into single factor | [Troubleshooting](references/troubleshooting.md#error-the-model-matrix-is-not-full-rank) | | "counts should be integers" | Use `DESeqDataSetFromTximport()` for tximport data | [Troubleshooting](references/troubleshooting.md#error-counts-matrix-should-contain-integer-values) | | "factor levels not in colData" | Check spelling in design formula vs colData columns | [Troubleshooting](references/troubleshooting.md#error-factor-levels-not-in-coldata) | | NA values in padj | Normal - independent filtering removes low-count genes | [Troubleshooting](references/troubleshooting.md#too-many-na-values-in-padj-column) | | No significant genes | Check PCA for batch effects, verify reference level | [Troubleshooting](references/troubleshooting.md#no-significant-genes-found) | **See [references/troubleshooting.md](references/troubleshooting.md) for comprehensive troubleshooting guide.** ## Best Practices 1. ✅ **Always pre-filter** low-count genes before `DESeq()` 2. ✅ **Set reference level** explicitly with `relevel()` 3. ✅ **Use padj** (not pvalue) for significance 4. ✅ **Use shrinkage** for visualization, unshrunk for testing 5. ✅ **Check QC plots** before trusting results 6. ✅ **Use vst()** for >30 samples, rlog() for <30 samples 7. ✅ **Document design formula** and report DESeq2 version ## Related Workflows **Alternative methods (use instead of this skill):** - **[edger](../edger/)** - Use for small samples (n=2-3) or many contrasts - **[limma-voom](../limma-voom/)** - Use for normalized data (TPM/FPKM) **Prerequisites (optional, run before):** - **[batch-correction](../batch-correction/)** - Remove batch effects before DE **Downstream (run after this skill):** - **[de-visualization](../de-visualization/)** - Volcano plots, MA plots, heatmaps - **[de-results](../de-results/)** - Filter and annotate significant genes - **pathway-analysis** - GO/KEGG enrichment - **gsea** - Gene set enrichment analysis ## References **Detailed documentation:** - [references/deseq2-reference.md](references/deseq2-reference.md) - Complete code patterns and examples - [references/decision-guide.md](references/decision-guide.md) - Detailed decision-making guidance - [references/troubleshooting.md](references/troubleshooting.md) - Comprehensive error solutions **Scripts:** - [scripts/basic_workflow.R](scripts/basic_workflow.R) - Complete example workflow - [scripts/qc_plots.R](scripts/qc_plots.R) - Quality control functions - [scripts/extract_results.R](scripts/extract_results.R) - Results extraction functions - [scripts/export_results.R](scripts/export_results.R) - Export functions - [scripts/transformations.R](scripts/transformations.R) - Transformation functions **Official documentation:** - DESeq2 Bioconductor: http://bioconductor.org/packages/DESeq2 - DESeq2 paper: Love et al. (2014) Genome Biology **License:** LGPL (>= 3) (commercial use permitted)