# teacher-data-distillation > Generate high-quality training data using powerful LLMs (Teacher Models) to train smaller models (Student Models). This is data-centric knowledge distillation - the teacher generates labeled data, not logits. Use this skill when the user needs to: (1) Generate NER/entity annotation data using LLM, (2) Create embedding training pairs (query-positive-negative) with LLM, (3) Generate text classification datasets, (4) Create instruction-tuning data for fine-tuning, (5) Synthesize domain-specific training corpora, (6) Augment existing datasets with LLM, (7) Quality control and filtering of generated data. Supports OpenAI GPT-4, Claude, and local LLMs as teacher models. - Author: Ryan - Repository: YENSTDi/agent_skills - Version: 20260203094911 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/YENSTDi/agent_skills - Web: https://mule.run/skillshub/@@YENSTDi/agent_skills~teacher-data-distillation:20260203094911 --- --- name: teacher-data-distillation description: | Generate high-quality training data using powerful LLMs (Teacher Models) to train smaller models (Student Models). This is data-centric knowledge distillation - the teacher generates labeled data, not logits. Use this skill when the user needs to: (1) Generate NER/entity annotation data using LLM, (2) Create embedding training pairs (query-positive-negative) with LLM, (3) Generate text classification datasets, (4) Create instruction-tuning data for fine-tuning, (5) Synthesize domain-specific training corpora, (6) Augment existing datasets with LLM, (7) Quality control and filtering of generated data. Supports OpenAI GPT-4, Claude, and local LLMs as teacher models. --- # Teacher Data Distillation Generate high-quality training data using powerful LLMs to train smaller, deployable models. ## Core Philosophy ``` ┌─────────────────┐ Generate Data ┌─────────────────┐ │ Teacher Model │ ───────────────────▶ │ Training Data │ │ (GPT-4/Claude) │ │ (High Quality) │ └─────────────────┘ └────────┬────────┘ │ ▼ Train ┌─────────────────┐ │ Student Model │ │ (BERT/Small LM) │ └─────────────────┘ ``` **Why this approach?** - No need for expensive manual annotation - Teacher's knowledge encoded in generated data - Student can be deployed cheaply (small, fast, private) - Works for any NLP task ## Quick Start ### Prerequisites ```bash pip install openai anthropic langchain tqdm --break-system-packages ``` ### Environment Setup ```bash export OPENAI_API_KEY="sk-..." export ANTHROPIC_API_KEY="sk-ant-..." ``` ## Supported Generation Tasks | Task | Script | Output Format | |------|--------|---------------| | NER Annotation | `scripts/generate_ner_data.py` | JSON with tokens + BIO tags | | Embedding Pairs | `scripts/generate_embedding_pairs.py` | Query-Positive-Negative triplets | | Classification | `scripts/generate_classification_data.py` | Text + Label | | Instruction Tuning | `scripts/generate_instruction_data.py` | Instruction-Input-Output | ## Task-Specific Workflows ### 1. NER Data Generation Generate entity-annotated data for training NER models. ```python from scripts.generate_ner_data import NERDataGenerator generator = NERDataGenerator( teacher_model="gpt-4", entity_types=["JOB_TITLE", "SKILL", "COMPANY", "SALARY"], domain="HR/Job Posting", language="zh-TW", ) # Generate from seed examples data = generator.generate( num_samples=1000, seed_examples=[ "我們需要一位熟悉 Python 的軟體工程師", "台積電徵求 AI 研究員,月薪 15 萬起", ], ) # Save for training generator.save("data/ner_train.json") ``` **Generation Strategy:** 1. Provide entity definitions and examples to teacher 2. Ask teacher to generate diverse sentences with entities 3. Teacher outputs structured JSON with BIO tags 4. Validate and filter generated data ### 2. Embedding Pairs Generation Generate training pairs for contrastive learning. ```python from scripts.generate_embedding_pairs import EmbeddingPairGenerator generator = EmbeddingPairGenerator( teacher_model="claude-3-opus", domain="Job Matching", ) # Generate query-positive-negative triplets pairs = generator.generate_triplets( num_samples=500, pair_types=[ "job_title_synonym", # 軟體工程師 ↔ Python Developer "job_skill_match", # 職缺 ↔ 相關技能 "resume_job_match", # 履歷 ↔ 適合職缺 ], ) generator.save("data/embedding_triplets.json") ``` ### 3. Classification Data Generation Generate labeled data for text classification. ```python from scripts.generate_classification_data import ClassificationGenerator generator = ClassificationGenerator( teacher_model="gpt-4", labels={ "tech_job": "技術類職缺", "sales_job": "業務類職缺", "admin_job": "行政類職缺", "creative_job": "創意類職缺", }, ) data = generator.generate( num_per_class=200, style="job_posting", ) generator.save("data/job_classification.json") ``` ### 4. Instruction Data Generation Generate instruction-following data for fine-tuning chatbots. ```python from scripts.generate_instruction_data import InstructionGenerator generator = InstructionGenerator( teacher_model="claude-3-opus", domain="HR Chatbot", ) data = generator.generate( num_samples=1000, task_types=[ "job_recommendation", # 根據條件推薦職缺 "resume_feedback", # 履歷修改建議 "salary_negotiation", # 薪資談判建議 "interview_tips", # 面試技巧 ], ) generator.save("data/hr_instructions.json") ``` ## Prompt Engineering for Data Generation ### Key Principles 1. **Clear Task Definition**: Tell teacher exactly what to generate 2. **Output Format Specification**: Use JSON schema for structured output 3. **Few-shot Examples**: Provide 3-5 high-quality examples 4. **Diversity Instructions**: Explicitly ask for variety 5. **Domain Context**: Provide domain-specific knowledge ### Example: NER Generation Prompt See `references/prompt_templates.md` for complete templates. ``` You are a data annotation expert. Generate training data for NER. Entity Types: - JOB_TITLE: 職稱 (e.g., 軟體工程師、產品經理) - SKILL: 技能 (e.g., Python、機器學習) - COMPANY: 公司 (e.g., 台積電、Google) Output Format (JSON): {"tokens": ["我", "在", "台積電", "工作"], "ner_tags": ["O", "O", "B-COMPANY", "O"]} Generate 10 diverse job posting sentences with entity annotations. Vary: sentence length, entity combinations, writing style. ``` ## Quality Control ### Validation Pipeline ```python from scripts.quality_control import QualityController qc = QualityController() # Validate NER data validated_data = qc.validate_ner( data, checks=[ "format_valid", # JSON 格式正確 "bio_consistency", # BIO 標籤一致 "entity_coverage", # 實體類型覆蓋 "length_distribution", # 長度分布合理 ], ) # Deduplicate deduplicated = qc.deduplicate(validated_data, similarity_threshold=0.9) # Filter by confidence (if teacher provides) high_quality = qc.filter_by_confidence(deduplicated, min_confidence=0.8) ``` ### Quality Metrics | Metric | Target | Description | |--------|--------|-------------| | Format Validity | 100% | All outputs parse correctly | | Entity Coverage | >90% | All entity types represented | | Diversity Score | >0.7 | Low duplicate/similar samples | | Length Variance | σ>10 | Good length distribution | ## Cost Optimization ### Batch Generation ```python # Generate in batches to reduce API calls generator.generate_batch( num_samples=1000, samples_per_call=10, # Generate 10 samples per API call parallel_calls=5, # 5 concurrent requests ) ``` ### Model Selection | Teacher Model | Quality | Cost | Speed | Best For | |---------------|---------|------|-------|----------| | GPT-4 | ★★★★★ | $$$$ | Slow | Complex tasks, high quality | | GPT-4o-mini | ★★★★ | $$ | Fast | Balanced cost/quality | | Claude 3 Opus | ★★★★★ | $$$$ | Slow | Nuanced understanding | | Claude 3 Haiku | ★★★ | $ | Fast | Large volume generation | | Local LLM | ★★★ | Free | Varies | Privacy, no API limits | ### Token Estimation ```python from scripts.utils import estimate_cost cost = estimate_cost( num_samples=1000, avg_tokens_per_sample=200, model="gpt-4", ) print(f"Estimated cost: ${cost:.2f}") ``` ## Integration with Training After generating data, use with `huggingface-nlp-trainer` skill: ```python # 1. Generate NER data with teacher from scripts.generate_ner_data import NERDataGenerator generator = NERDataGenerator(teacher_model="gpt-4", ...) generator.generate(num_samples=5000) generator.save("data/ner_train.json") # 2. Train student model from huggingface_nlp_trainer.scripts.train_ner import NERTrainer trainer = NERTrainer( model_name="ckiplab/bert-base-chinese-ner", train_file="data/ner_train.json", ) trainer.train() ``` ## Additional Resources - **Prompt templates**: See `references/prompt_templates.md` - **Quality control guide**: See `references/quality_control.md` - **Distillation strategies**: See `references/distillation_strategies.md`