# nf-pipeline-to-galaxy-workflow > Convert a complete Nextflow pipeline to Galaxy - Author: Anton Nekrutenko - Repository: galaxyproject/skills - Version: 20260204091402 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/galaxyproject/skills - Web: https://mule.run/skillshub/@@galaxyproject/skills~nf-pipeline-to-galaxy-workflow:20260204091402 --- --- name: nf-pipeline-to-galaxy-workflow description: Convert a complete Nextflow pipeline to Galaxy --- # Nextflow Pipeline to Galaxy Workflow ## When to Use Use this skill when: - Converting a complete Nextflow pipeline (e.g., nf-core pipeline) - Creating a full Galaxy solution with multiple workflows - Planning a large-scale conversion project **Don't use this skill if**: - Converting a single process (use `nf-process-to-galaxy-tool` instead) - Converting a single subworkflow (use `nf-subworkflow-to-galaxy-workflow` instead) --- ## Step-by-Step Process ### Step 1: Clarify Workflow Scope with User **REQUIRED before starting conversion** - ask the user: 1. **What parts of the Nextflow pipeline to convert?** - Full pipeline end-to-end? - Specific subworkflow only? - Exclude certain optional steps? - Example: "Do you want preprocessing + alignment + variant calling + annotation, or just variant calling?" 2. **Quality control and reporting requirements:** - Include QC steps (FastQC, etc.)? - Include aggregated reporting (MultiQC)? - Include intermediate QC checks? 3. **Best-practice scientific / bioinformatics sanity check (REQUIRED):** - If the requested scope has conceptual issues (e.g., missing required preprocessing steps like sort/index, omitting essential QC, incompatible inputs/reference/annotation, or a too-small subset that would violate common analysis best practices), **flag the issue**. - Ask the user whether they want to: - Expand/adjust the scope to address the best-practice issue(s), or - Proceed with the user’s requested scope as-is (explicitly confirmed). 4. **Workflow metadata:** - **Workflow name** (descriptive, user-facing) - **Author/Creator name(s)** - **License** (e.g., MIT, Apache-2.0, GPL-3.0) - **Description/Annotation** - **Tags** for categorization **NEVER use placeholder values** - always get real information from the user. **NEVER assume scope** - explicitly confirm what to include/exclude. ### Step 2: Analyze Pipeline Structure Read the main Nextflow file and all imported workflows/subworkflows. **Document**: - All processes used - Data flow between processes - Conditionals and branches - Input/output patterns ### Step 3: Check Tool Availability and Requirements (CRITICAL) **For each process**, verify a corresponding Galaxy tool exists: - **Search tools-iuc repository** and verify tool XML exists - **Read the tool XML** to confirm: - Exact tool ID and current version - Actual input/output parameter names (for connections) - Conditional parameter structures - Whether tool requires pre-built databases/indices - **NEVER assume a tool exists** without verification - **NEVER use placeholder or non-existent tools** - If a tool doesn't exist, inform user and discuss alternatives **CRITICAL: Verify tool owner/repository** (COMMON ERROR): - Tools can exist under **multiple owners** (e.g., `devteam`, `iuc`, `bgruening`) - **Check the actual ToolShed** at https://toolshed.g2.bx.psu.edu/ to confirm correct owner - Example: `freebayes` is under `devteam`, NOT `iuc` - Example: `samtools_sort` is under `iuc`, NOT `devteam` - **Tool ID format**: `toolshed.g2.bx.psu.edu/repos/{owner}/{repo}/{tool}/{version}` - Wrong owner = "tool not available" error even if tool exists - **Always verify owner** before using tool in workflow **Check for tool dependencies:** - Does the tool need a database built first? (e.g., SnpEff needs database, not just GTF) - Does the tool need sorted/indexed input? (e.g., variant callers need sorted BAM + index) - Are there helper tools needed? (e.g., samtools sort/index after alignment) Use the **`check-tool-availability.md`** reference. Prefer a **splitter approach** when the Nextflow workflow: - Has multiple “modes” controlled by flags (`--mode`, `--skip_*`, etc.) - Has large optional branches triggered by optional inputs - Exposes a very large parameter surface (many knobs that overwhelm a Galaxy form) - Produces materially different outputs depending on configuration **Splitter pattern**: - Create 2–N intent-focused Galaxy workflows with simpler inputs - Keep an “advanced” variant only when needed - Minimize workflow conditionals; prefer separate workflows when branches are substantial **Deliverable**: your plan should explicitly state whether you are producing: - One Galaxy workflow - Multiple related Galaxy workflows (recommended for complex pipelines) - Optional “meta-workflow” that chains subworkflows (only if it improves usability) **Example** (CAPHEINE): ``` CAPHEINE Pipeline: ├── PIPELINE_INITIALISATION (setup) ├── CAPHEINE (main analysis) │ ├── PROCESS_VIRAL_NONRECOMBINANT (preprocessing subworkflow) │ ├── HYPHY_ANALYSES (analyses subworkflow) │ └── DRHIP (aggregation) └── PIPELINE_COMPLETION (reporting) ``` **CAPHEINE splitter example (foreground branch)**: CAPHEINE has an optional branch controlled by providing a foreground regexp/list. In Galaxy, it may be clearer to publish two workflows: - **Workflow A: CAPHEINE (no foreground)** - Inputs: reference genes, unaligned sequences - Runs the baseline HyPhy analyses - Outputs: baseline result set - **Workflow B: CAPHEINE (with foreground)** - Inputs: reference genes, unaligned sequences, foreground regexp/list - Runs additional foreground-dependent HyPhy analyses - Outputs: baseline + additional foreground-dependent results This keeps each workflow form smaller and avoids large conditional sections. ### Step 2: Check Tool Availability for All Processes **Use**: `../check-tool-availability.md` and `../scripts/check_tool.sh` Create comprehensive tool inventory: ```bash cd ../ # List all unique tools used in pipeline for tool in tool1 tool2 tool3; do ./scripts/check_tool.sh $tool done ``` **Document all findings** in a table: ``` | Process | Tool | Status | Action | |---------|------|--------|--------| | PROCESS_A | tool_a | ✅ Exists | Use existing | | PROCESS_B | tool_b | ❌ Missing | Create tool | ``` ### Step 3: Create Conversion Plan **Plan must include**: 1. **Tool inventory** (from Step 2) - Existing tools and their locations - Missing tools that need creation 2. **Tool creation strategy** (for missing tools) - Which should go in tools-iuc? - Which should be custom? - Ask user about tools-iuc access 3. **Workflow structure** - Single workflow or multiple related workflows? - If multiple: how do they differ by intent/inputs/outputs? - Nested subworkflows? - Where do you intentionally split to reduce conditionals/knobs? 4. **Implementation order** - Which tools to create first - Which workflows to build first - Testing strategy **Present plan to user and wait for approval before implementing.** **See**: `../tool-sources.md` for tool placement decisions ### Step 4: Create Missing Tools **For each missing tool**, use the **`nf-process-to-galaxy-tool` skill**. **Wait for all tools to be created and validated before continuing.** ### Step 5: Build Workflows **For each subworkflow**, use the **`nf-subworkflow-to-galaxy-workflow` skill**. **Caveats (important for most pipelines)**: - **UUIDs must be in proper UUID4 format**: - Galaxy validates that all `uuid` fields are valid UUID4 strings (e.g., `550e8400-e29b-41d4-a716-446655440000`). - **UUIDs must be unique across the entire workflow JSON**: - Galaxy rejects workflows if any `uuid` is duplicated (common failure: copying the workflow `uuid` into a step `uuid`, or reusing the same step UUID across multiple steps). - Check uniqueness for: - Workflow-level `uuid` - Every `steps..uuid` - Every `steps..workflow_outputs[*].uuid` - Validate before import (e.g., parse JSON and ensure the set of UUIDs has no duplicates). - **Interpret Galaxy warnings correctly**: The user may provide a list of warnings produced by Galaxy for a workflow they try to import. - **Benign (often OK)**: - Warnings for optional advanced parameters (e.g. “Disable grouping… Using default False”). - These usually just indicate the workflow JSON omitted optional parameters and Galaxy filled defaults. - **Usually indicates a real workflow bug**: - Warnings where the missing value is a *dataset input* (e.g. “BAM file … default ''”, “VCF Data … default ''”, “GFF dataset … default ''”). - This typically means: - `input_connections` uses the wrong parameter name, **or** - a conditional selector in `tool_state` was not set, so Galaxy expects a different input branch. - **Environment mismatch (action required)**: - “Tool is not installed” → the target Galaxy instance doesn’t have that exact `tool_id`. - “Using version X instead of version Y specified” → Galaxy substituted an installed version. - This is often OK, but you must re-check parameter names/outputs against the installed tool. - **These two mistakes are extremely common — preempt them**: - **Tool ID / owner / version mismatch**: - The workflow may reference a real tool, but under the wrong owner or an uninstalled version. - Symptoms: “Tool is not installed”, “tool not available”, or silent version substitution. - Mitigation: verify ToolShed owner + inspect the tool XML for the exact `tool_id` + version. - **Conditional branch / input key mismatch**: - The tool may have the correct upstream dataset available, but Galaxy still warns that the dataset input is empty. - Symptoms: “No value found for 'BAM file'… default ''” / “No value found for 'GFF dataset'… default ''”. - Mitigation: set the correct selector in `tool_state` *and* use the exact conditional parameter path in `input_connections`. - **Tool forms often require setting conditional selectors in `tool_state`**: - Many tools expose inputs only after a selector is set (e.g. “use cached genome vs history dataset”, “single vs collection”, “region mode on/off”). - Even if you wire `input_connections` correctly, Galaxy may still treat the input as missing unless the selector branch is chosen in `tool_state`. - **Validation checkpoint (recommended)**: - After generating the `.ga`, import into the target Galaxy and scan for: - Any “Tool is not installed” messages (fix tool IDs/owners/versions) - Any dataset-input warnings defaulting to empty (fix `input_connections` key names and/or conditional selectors) - Any version substitutions that might change parameter/output names - If any dataset-input warnings remain, treat the workflow as **not runnable** until resolved. - **Iterate using Galaxy’s import warnings report**: - Do not assume the user will proactively provide these warnings. - If your first draft doesn’t import cleanly, ask the user to: - Import the `.ga` into their Galaxy instance - Copy/paste the resulting import warnings report - Use that report to quickly identify which failures are: - Missing tools (instance mismatch) - Wrong tool IDs/owners/versions - Wrong `input_connections` keys - Missing conditional selectors in `tool_state` - **Tool existence and repository owner must be verified**: - Tools can exist under multiple owners; wrong owner produces “tool not available” errors. - Confirm tool IDs via ToolShed URLs and/or tool XML in the authoritative repository. - **Tool existence must be verified** (CRITICAL): - **NEVER reference a tool without verifying it exists** in tools-iuc or target repository. - Check the actual tool XML file to confirm tool ID and versions. - **NEVER use placeholder or assumed tool names**. - If a tool doesn't exist, inform user and discuss alternatives - **Tool semantics must be validated (tool may exist but still be wrong)** (CRITICAL): - Do not stop at “the tool exists” — confirm it performs the intended transformation. - If semantics are unclear, **ask the user** what behavior is expected before drafting the step. - Example (CAPHEINE): `seqkit_split2` exists, but it splits into parts/chunks; for “one FASTA record → one dataset in a collection”, use an appropriate splitter tool (e.g. ToolShed `rnateam/splitfasta` / `rbc_splitfasta`). - **Tool input/output connections are REQUIRED** (CRITICAL): - **Every tool step must have `input_connections`** that wire it to upstream steps (except workflow inputs). - **Tools without connections will not receive data and will fail at runtime**. - Each connection must specify: upstream step `id` + exact `output_name` from that step. - **Read the actual tool XML** to get exact input parameter names - do not guess. - Parameter names often use conditional paths (e.g., `reference_cond|reference_history` not `reference`). - Input names vary by tool (e.g., CAwlign uses `fasta` not `query`). - Output names must match tool definitions (e.g., `labeled_tree` not `output`). - **Incorrect connections cause execution failures** even if import succeeds. - Explicitly tell user which connections need verification. - **Dataset collections vs individual datasets**: - Paired FASTQ inputs should use `data_collection_input` type with `collection_type: "paired"`. - Tools that process collections need collection-aware parameter names (check tool XML). - Some tools iterate over collections, others process them as a unit. - Collection outputs have `type: "input"` to preserve collection structure. - **Tool dependencies and helper steps**: - Some tools require pre-built databases (e.g., SnpEff needs database built from GTF, not raw GTF). - Some tools require sorted/indexed input (e.g., variant callers need sorted BAM + BAI index). - Add helper tools as needed (e.g., samtools sort/index after alignment). - Check Nextflow workflow for these preprocessing steps. - **Tool versions in drafted `.ga` files are often placeholders**: - If you have access to the target Galaxy instance (UI or API), **resolve each step's `tool_id`/`tool_version` to what is actually installed** (tool revisions and `+galaxyN` suffixes differ). - If you cannot check the instance, use the most recent version of each tool and treat any `tool_id`/`tool_version` you emit as a **placeholder** and explicitly tell the user they must verify/adjust versions against their Galaxy. **Recommended structure**: - Break pipeline into logical workflow units - Create subworkflows first - Compose into main workflow(s) **Example** (CAPHEINE): ``` Workflow 1: CAPHEINE_Preprocessing - Input: Raw sequences - Output: Alignment + Tree Workflow 2: CAPHEINE_Analyses - Input: Alignment + Tree - Output: Analysis results + DRHIP report ``` ### Step 6: Integration Testing Use the canonical testing docs: - Tool testing (Planemo): `../../tool-dev/shared/testing.md` - Workflow testing/validation (Galaxy instance): `../../galaxy-integration/galaxy-integration.md` `../testing-and-validation.md` is a short routing page that links to these. At minimum: 1. Upload pipeline test data 2. Run workflows in sequence 3. Compare outputs to Nextflow results (structural/semantic match) 4. Document differences ### Step 7: Documentation Create documentation for users: - Which workflows to run in what order - Input data requirements - Expected outputs - Tool versions used - Differences from Nextflow version (if any) --- ## Quick Reference **Nextflow pipeline = Multiple Galaxy workflows + tools** **Decomposition strategy**: 1. Identify logical workflow boundaries 2. Check all tool availability 3. Create missing tools (use `nf-process-to-galaxy-tool`) 4. Build subworkflows (use `nf-subworkflow-to-galaxy-workflow`) 5. Compose and test --- ## Resources All detailed guides are in parent directory (`../`): - `check-tool-availability.md` - Tool availability checking - `tool-sources.md` - Where to create tools - `workflow-to-ga.md` - Workflow creation guide - `nextflow-galaxy-terminology.md` - Conceptual mappings - `testing-and-validation.md` - Routing page to canonical testing docs **Related skills**: - `nf-process-to-galaxy-tool` - For creating missing tools - `nf-subworkflow-to-galaxy-workflow` - For building workflow components --- ## Complete Example: CAPHEINE Pipeline See `../examples/capheine-mapping.md` for full CAPHEINE conversion. **Key findings**: - ✅ 15/15 tools exist in tools-iuc (100% coverage) - ✅ No tool creation needed - ✅ Conversion is purely workflow assembly - Recommended: 2 workflows (preprocessing + analyses) **This demonstrates the ideal case**: Most established pipelines use common tools that already exist in Galaxy.