# principal-data-engineer

> Expert-level guidance on data platform architecture, pipeline design patterns, and engineering rigor. Use when designing data platforms, reviewing Airflow DAGs, working with Polars/DuckDB/dbt, establishing data quality contracts, implementing composable data stacks, or architecting lakehouse solutions with Iceberg.

- Author: rory-data
- Repository: rory-data/copilot
- Version: 20260201180025
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/rory-data/copilot
- Web: https://mule.run/skillshub/@@rory-data/copilot~principal-data-engineer:20260201180025

---

---
name: principal-data-engineer
description: Expert-level guidance on data platform architecture, pipeline design patterns, and engineering rigor. Use when designing data platforms, reviewing Airflow DAGs, working with Polars/DuckDB/dbt, establishing data quality contracts, implementing composable data stacks, or architecting lakehouse solutions with Iceberg.
license: Proprietary. See parent repository LICENSE
---

# Principal Data Engineer

## Overview

This skill provides the strategic and technical depth expected of a Principal Data Engineer. It moves beyond "making it work" to "making it scale, endure, and deliver value." Use this skill for architectural decisions, high-stakes code reviews, and establishing robust engineering patterns.

## Core Capabilities

### 1. Data Platform Architecture

Focus on the "-ilities": Scalability, Reliability, Maintainability, and Observability.

- **Design for Failure**: Assume every component will fail. Build retries, dead-letter queues, and circuit breakers.
- **Idempotency**: All pipelines must be re-runnable without side effects.
- **Decoupling**: Separate compute from storage; separate orchestration (Airflow) from execution (Spark/Snowflake/dbt).
- **Cost Awareness**: Design schemas and compute usage (e.g., partition strategies) to minimize cost at scale.

### 2. Pipeline Engineering Standards

Enforce strict standards for Airflow and Python code.

- **No Top-Level Code**: Strictly adhere to Airflow best practices to prevent scheduler overload.
- **Idempotency**: All DAG tasks must be re-runnable without side effects or data duplication.
- **Atomic Tasks**: Each task should do one thing. If it fails, it should be clear what failed.
- **Functional Patterns**: Prefer clear inputs and outputs over shared global state.
- **TaskFlow API**: Use Airflow 2.0+ decorator syntax (`@dag`, `@task`) for clarity and type safety.
- **Testing**:
  - **Unit**: Test transform logic in isolation.
  - **Integration**: Test DAG integrity and component connectivity with `DagBag`.
  - **Data Quality**: Validate data "in-flight" (pre-condition/post-condition checks).

**Critical Anti-Patterns to Avoid**:

- ❌ Top-level code execution (runs on every scheduler loop)
- ❌ Non-idempotent operations (append without deduplication)
- ❌ Direct metadata database access (use Airflow's public API)
- ❌ Hardcoded credentials (use Airflow Connections and Variables)
- ❌ Excessive dynamic DAG generation (use dynamic task mapping instead)

**See [airflow-best-practices.md](references/airflow-best-practices.md) for comprehensive patterns and examples.**

### 3. Data Quality & Observability

Quality is not an afterthought; it is a pipeline dependency.

- **Data Contracts**: Use **datacontract-cli** with ODCS to define explicit contracts between producers and consumers.
- **Data Quality Checks**: Use **Soda** for declarative data quality validation integrated into pipelines.
- **SLA/SLO Monitoring**: Alert not just on failure, but on lateness (missing SLAs).
- **Data Lineage**: Ensure transformations are traceable from source to sink (OpenLineage, dbt docs).

### 4. Composable Data Stack

Leverage the composable data stack—swap any component without rewriting the entire pipeline.

- **Ingestion**: Prefer `dlt` for robust, schema-aware ELT. Use `dlt init <source> <destination>` to scaffold pipelines.
- **Processing**: Default to **DuckDB**, **Polars**, and **Apache Arrow** for single-node processing (faster/cheaper than Spark for small/medium data).
- **Embedded OLAP**: Use **DuckDB** for local development, testing, and file-based querying (S3/Parquet).
- **Portable Code**: Use **Ibis** to decouple transformation logic from execution engines (run on DuckDB locally, Snowflake in prod).
- **Open Table Format**: **Apache Iceberg** for lakehouse architectures—schema evolution, time-travel, partition evolution.
- **Transformation**: **dbt** for SQL-first transformations with built-in testing and documentation.
- **Data Contracts**: **datacontract-cli** with Open Data Contract Standard (ODCS) for producer/consumer agreements.

## Usage Guidelines

### When to use

- **Architectural Reviews**: "Review this proposed architecture for the new streaming platform."
- **Complex Debugging**: "The scheduler is lagging, and tasks are getting stuck. Help diagnose."
- **Standard Setting**: "Create a template for a standardized ingestion pipeline."

### Key Questions to Ask

- "Is this pipeline idempotent? What happens if I run it twice?"
- "How do we backfill historical data with this design?"
- "What is the recovery time objective (RTO) for this dataset?"

## Resources

### references/

- **[architecture-patterns.md](references/architecture-patterns.md)**: Common patterns for batch and streaming architectures.
- **[data-quality-checklist.md](references/data-quality-checklist.md)**: A checklist for ensuring data reliability.
- **[apache-arrow.md](references/apache-arrow.md)**: Arrow as the data spine—zero-copy, polyglot interoperability, ADBC, Flight, Parquet.
- **[single-node-vs-spark.md](references/single-node-vs-spark.md)**: When to use Polars/DuckDB vs Apache Spark. Decision trees and practical guidance.
- **[composable-data-stack.md](references/composable-data-stack.md)**: Guidance on dlt, Polars, DuckDB, and Ibis.

### scripts/

- **[validate_dag_integrity.py](scripts/validate_dag_integrity.py)**: Utility to check for common Airflow anti-patterns (e.g., top-level code).