# blockchain-data-collection-validation > Empirical validation workflow for blockchain data collection pipelines before production implementation. Use when validating data sources, testing DuckDB integration, building POC collectors, or verifying complete fetch-to-storage pipelines for blockchain data. - Author: terrylica - Repository: terrylica/gapless-network-data - Version: 20251211121504 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/terrylica/gapless-network-data - Web: https://mule.run/skillshub/@@terrylica/gapless-network-data~blockchain-data-collection-validation:20251211121504 --- --- name: blockchain-data-collection-validation description: Empirical validation workflow for blockchain data collection pipelines before production implementation. Use when validating data sources, testing DuckDB integration, building POC collectors, or verifying complete fetch-to-storage pipelines for blockchain data. --- # Blockchain Data Collection Validation ## Overview This skill provides a systematic, test-driven workflow for validating blockchain data collection pipelines before production implementation. Use when building POC collectors, validating new data sources, testing DuckDB integration, or verifying complete fetch-to-storage workflows. **Key principle**: Validate every component empirically before implementation—connectivity, schema, rate limits, storage, and complete pipeline. ## Validation Workflow This skill follows a 5-step empirical validation workflow: | Step | Purpose | Output | Success Criteria | | ------------------- | ---------------------------- | ------------------------------- | ------------------------------------- | | **1. Connectivity** | Test basic RPC access | Block fetch confirmed | Response <500ms, no errors | | **2. Schema** | Validate all required fields | Field validation report | All fields present, types correct | | **3. Rate Limits** | Find sustainable RPS | Empirical rate (e.g., 5.79 RPS) | 100% success over 50+ blocks | | **4. Pipeline** | Test fetch→DuckDB flow | Complete pipeline working | Data persisted, constraints pass | | **5. Decision** | Document findings | Go/No-Go recommendation | All steps passed, timeline calculated | **Detailed workflow**: See `references/validation-workflow.md` for complete step-by-step guide with code templates, testing patterns, and success criteria for each step. **Quick start**: Create `01_single_block_fetch.py` using template in `scripts/`, then iterate through steps 2-5. ## DuckDB Integration Patterns **Critical patterns for data integrity**: - CHECKPOINT requirement (crash-tested, prevents data loss) - Batch INSERT from DataFrame (124K blocks/sec performance) - CHECK constraints for schema validation - Storage estimates (76-100 bytes/block empirically validated) **Full guide**: See `references/duckdb-patterns.md` for complete DuckDB integration guide with code examples, crash test results, and performance benchmarks. ## Common Pitfalls **Critical mistakes to avoid**: Skipping empirical rate validation, testing <50 blocks, forgetting DuckDB CHECKPOINT (data loss), ignoring CHECK constraints, and parallel fetching on free tiers. **Real-world examples**: LlamaRPC 50 RPS documented → 1.37 RPS sustainable (2.7% of max), parallel fetch worked for 20 blocks → failed at 50. **Full guide**: See `references/common-pitfalls.md` for detailed anti-patterns with problem/reality/solution format and code examples. ## Scripts POC template scripts for empirical validation: - `poc_single_block.py` - Connectivity and schema validation (Steps 1-2) - `poc_batch_parallel_fetch.py` - Parallel fetch testing (Step 3, expect failures) - `poc_rate_limited_fetch.py` - Rate-limited sequential fetch (Step 3, find sustainable rate) - `poc_complete_pipeline.py` - Complete fetch→DuckDB pipeline (Step 4) **Templates and usage**: See `scripts/README.md` for complete code templates, usage examples, and testing progression guide. ## References ### Workflow Documentation - `references/validation-workflow.md` - Complete 5-step workflow with detailed guidance, code examples, and success criteria - `references/common-pitfalls.md` - Anti-patterns to avoid with problem/reality/solution format - `references/example-workflow.md` - Complete case study: Validating Alchemy for Ethereum collection ### Technical Patterns - `references/duckdb-patterns.md` - DuckDB integration patterns (CHECKPOINT, batch INSERT, constraints, performance) - `references/ethereum-collector-poc-findings.md` - Ethereum collector POC case study with rate limit discovery ### Scripts - `scripts/README.md` - Complete script templates and testing progression guide - `scripts/poc_single_block.py` - Connectivity and schema validation template - `scripts/poc_batch_parallel_fetch.py` - Parallel fetch testing template - `scripts/poc_rate_limited_fetch.py` - Rate-limited fetch template - `scripts/poc_complete_pipeline.py` - Complete pipeline template ## Example Workflow **Case study**: Validating Alchemy for Ethereum collection → ✅ GO at 5.79 RPS sustained (26 days for 13M blocks, HIGH confidence). **Full walkthrough**: See `references/example-workflow.md` for complete step-by-step case study showing all 5 validation steps with actual test results and final decision. ## When to Use This Skill Invoke this skill when: - Validating a new blockchain RPC provider before implementation - Testing DuckDB integration for blockchain data - Building POC collector for new blockchain - Verifying complete fetch-to-storage pipeline - Investigating data quality issues - Planning production collector implementation - Need empirical validation before committing to architecture ## Related Patterns This skill pairs well with: - `blockchain-rpc-provider-research` - For comparing multiple providers before validation - Project scratch investigations in `scratch/ethereum-collector-poc/` and `scratch/duckdb-batch-validation/`