# parallel-dataset-generation > Generate large training datasets (1000+ items) using parallel subagents with rate limit handling. Use when: (1) need to generate >1000 phrases/questions/examples, (2) Task tool hits 429 errors with concurrent agents, (3) need topic-specific variety in generated content, (4) working with multilingual datasets. Covers batch sizing (6 agents per batch), topic distribution strategy, and verification patterns. - Author: sani - Repository: khursanirevo/claude-config-sync - Version: 20260205210002 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/khursanirevo/claude-config-sync - Web: https://mule.run/skillshub/@@khursanirevo/claude-config-sync~parallel-dataset-generation:20260205210002 --- --- name: parallel-dataset-generation description: | Generate large training datasets (1000+ items) using parallel subagents with rate limit handling. Use when: (1) need to generate >1000 phrases/questions/examples, (2) Task tool hits 429 errors with concurrent agents, (3) need topic-specific variety in generated content, (4) working with multilingual datasets. Covers batch sizing (6 agents per batch), topic distribution strategy, and verification patterns. author: Claude Code version: 1.0.0 date: 2026-01-27 --- # Parallel Dataset Generation with Rate Limit Handling ## Problem Generating large training datasets (1000+ items) using Claude Code's Task tool often hits API rate limits (429 errors) when launching too many parallel agents simultaneously. Sequential generation is too slow for large datasets. ## Context / Trigger Conditions - Need to generate 500+ items (phrases, questions, examples, etc.) - Task tool returns "429 High concurrency usage" error - Dataset needs diverse topic coverage - Manual generation would take hours - Using Claude Code's Task tool with general-purpose subagents ## Solution ### Step 1: Plan Topic Distribution Divide target count by number of topics (aim for ~115-120 items per topic): ```python target = 1500 # Total items needed num_topics = 12 # Number of categories per_topic = target // num_topics # ~125 items per topic ``` ### Step 2: Launch in Batches (Not All at Once) **Critical**: Launch agents in batches of 5-6, NOT all simultaneously: ```python # ❌ WRONG - All at once causes 429 errors: for i in range(12): Task(subagent_type="general-purpose", prompt=...) # ✅ CORRECT - Batches of 6: # Batch 1: Topics 1-6 Task(..., prompt="Topic 1") Task(..., prompt="Topic 2") ... Task(..., prompt="Topic 6") # Wait for Batch 1 completion, then: # Batch 2: Topics 7-12 Task(..., prompt="Topic 7") ... ``` **Why batches of 6?** Empirically tested - 6 agents work reliably, 7+ may trigger rate limits. ### Step 3: Structured Prompt Template Each agent needs consistent structure: ``` Generate exactly 115-120 [items] as a Python list. Distribution: - Category A: 60% - Category B: 30% - Category C: 10% Requirements: 1. [Quality criteria] 2. [Authenticity requirements] 3. [Format specifications] Output format: Return ONLY the Python list, no explanations. ``` ### Step 4: Collect and Verify After agents complete: ```python # Count generated items import re with open('generated_file.py') as f: content = f.read() items = re.findall(r'["\']([^"\']+)["\']', content) print(f"Generated: {len(items)} items") # Verify all topics present topics = ['PHRASES_TOPIC1', 'PHRASES_TOPIC2', ...] for topic in topics: if topic in content: print(f"✅ {topic} found") else: print(f"❌ {topic} MISSING") ``` ### Step 5: Merge and Validate Create combined dataset with both topic-specific and legacy structures: ```python # Topic-specific lists (new) PHRASES_TOPIC1 = [...] PHRASES_TOPIC2 = [...] # Legacy structure (backward compatible) PHRASES_ALL = ( PHRASES_TOPIC1 + PHRASES_TOPIC2 + ... ) ``` ## Verification ```bash # Run this to verify: python3 << 'EOF' import re with open('your_file.py') as f: items = re.findall(r'["\']([^"\']+)["\']', f.read()) print(f"Total: {len(items)} items") print(f"Target: 1500 items") print(f"Achieved: {len(items) >= 1500}") EOF ``` Expected output: `Total: 1570 items`, `Achieved: True` ## Example: Full Workflow **Scenario**: Generate 1500 Malay TTS phrases in 12 categories ```python # Define 12 topics topics = [ ("Daily Conversations", "greetings, small talk"), ("Food & Dining", "meals, cooking"), ("Family", "relationships, kinship"), # ... 9 more topics ] # Batch 1: First 6 topics Task(prompt="Generate 115-120 phrases for Daily Conversations...") Task(prompt="Generate 115-120 phrases for Food & Dining...") Task(prompt="Generate 115-120 phrases for Family...") Task(prompt="Generate 115-120 phrases for Work...") Task(prompt="Generate 115-120 phrases for Education...") Task(prompt="Generate 115-120 phrases for Shopping...") # Wait for completion, check results # Batch 2: Last 6 topics Task(prompt="Generate 115-120 phrases for Travel...") # ... remaining 5 topics # Merge all files into final dataset ``` **Result**: 1569 phrases generated (exceeded 1500 target by 69) ## Notes **Batch Size Tuning**: - Safe: 5-6 agents per batch - Risky: 7-10 agents (may hit 429) - Dangerous: 12+ agents (almost guaranteed 429) **Handling Interruptions**: - If agents get interrupted, relaunch individually - Generated content is preserved in agent outputs - Extract using regex pattern from displayed outputs **Quality vs Speed**: - Batches of 6 = ~5-10 minutes per batch - Sequential (1 at a time) = ~30 minutes per batch - Parallel batches = 3x faster with same quality **Regex for Quote-Agnostic Extraction**: ```python # Captures both single and double quoted strings r'["\']([^"\']+)["\']' ``` ## Common Pitfalls ❌ **Launching all agents at once** → 429 rate limit errors ✅ **Launch in batches of 5-6** → Reliable execution ❌ **Vague prompts** → Inconsistent quality/style ✅ **Structured prompts with distribution** → Consistent output ❌ **Not counting quotes properly** → Off-by-phrase counts ✅ **Use quote-agnostic regex** → Accurate counts ❌ **Only topic categories** → Breaking existing code ✅ **Dual structure (topics + legacy)** → Backward compatible ## References - Claude Code Task tool documentation (internal) - Parallel processing patterns for ML datasets - Rate limiting best practices for API calls