# teacher-data-distillation

> Generate high-quality training data using powerful LLMs (Teacher Models) to train smaller models (Student Models).
This is data-centric knowledge distillation - the teacher generates labeled data, not logits.
Use this skill when the user needs to: (1) Generate NER/entity annotation data using LLM,
(2) Create embedding training pairs (query-positive-negative) with LLM,
(3) Generate text classification datasets, (4) Create instruction-tuning data for fine-tuning,
(5) Synthesize domain-specific training corpora, (6) Augment existing datasets with LLM,
(7) Quality control and filtering of generated data.
Supports OpenAI GPT-4, Claude, and local LLMs as teacher models.

- Author: Ryan
- Repository: YENSTDi/agent_skills
- Version: 20260203094911
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/YENSTDi/agent_skills
- Web: https://mule.run/skillshub/@@YENSTDi/agent_skills~teacher-data-distillation:20260203094911

---

---
name: teacher-data-distillation
description: |
  Generate high-quality training data using powerful LLMs (Teacher Models) to train smaller models (Student Models).
  This is data-centric knowledge distillation - the teacher generates labeled data, not logits.
  Use this skill when the user needs to: (1) Generate NER/entity annotation data using LLM,
  (2) Create embedding training pairs (query-positive-negative) with LLM,
  (3) Generate text classification datasets, (4) Create instruction-tuning data for fine-tuning,
  (5) Synthesize domain-specific training corpora, (6) Augment existing datasets with LLM,
  (7) Quality control and filtering of generated data.
  Supports OpenAI GPT-4, Claude, and local LLMs as teacher models.
---

# Teacher Data Distillation

Generate high-quality training data using powerful LLMs to train smaller, deployable models.

## Core Philosophy

```
┌─────────────────┐     Generate Data      ┌─────────────────┐
│  Teacher Model  │ ───────────────────▶   │  Training Data  │
│  (GPT-4/Claude) │                        │  (High Quality) │
└─────────────────┘                        └────────┬────────┘
                                                    │
                                                    ▼ Train
                                           ┌─────────────────┐
                                           │  Student Model  │
                                           │ (BERT/Small LM) │
                                           └─────────────────┘
```

**Why this approach?**
- No need for expensive manual annotation
- Teacher's knowledge encoded in generated data
- Student can be deployed cheaply (small, fast, private)
- Works for any NLP task

## Quick Start

### Prerequisites

```bash
pip install openai anthropic langchain tqdm --break-system-packages
```

### Environment Setup

```bash
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
```

## Supported Generation Tasks

| Task | Script | Output Format |
|------|--------|---------------|
| NER Annotation | `scripts/generate_ner_data.py` | JSON with tokens + BIO tags |
| Embedding Pairs | `scripts/generate_embedding_pairs.py` | Query-Positive-Negative triplets |
| Classification | `scripts/generate_classification_data.py` | Text + Label |
| Instruction Tuning | `scripts/generate_instruction_data.py` | Instruction-Input-Output |

## Task-Specific Workflows

### 1. NER Data Generation

Generate entity-annotated data for training NER models.

```python
from scripts.generate_ner_data import NERDataGenerator

generator = NERDataGenerator(
    teacher_model="gpt-4",
    entity_types=["JOB_TITLE", "SKILL", "COMPANY", "SALARY"],
    domain="HR/Job Posting",
    language="zh-TW",
)

# Generate from seed examples
data = generator.generate(
    num_samples=1000,
    seed_examples=[
        "我們需要一位熟悉 Python 的軟體工程師",
        "台積電徵求 AI 研究員，月薪 15 萬起",
    ],
)

# Save for training
generator.save("data/ner_train.json")
```

**Generation Strategy:**
1. Provide entity definitions and examples to teacher
2. Ask teacher to generate diverse sentences with entities
3. Teacher outputs structured JSON with BIO tags
4. Validate and filter generated data

### 2. Embedding Pairs Generation

Generate training pairs for contrastive learning.

```python
from scripts.generate_embedding_pairs import EmbeddingPairGenerator

generator = EmbeddingPairGenerator(
    teacher_model="claude-3-opus",
    domain="Job Matching",
)

# Generate query-positive-negative triplets
pairs = generator.generate_triplets(
    num_samples=500,
    pair_types=[
        "job_title_synonym",      # 軟體工程師 ↔ Python Developer
        "job_skill_match",        # 職缺 ↔ 相關技能
        "resume_job_match",       # 履歷 ↔ 適合職缺
    ],
)

generator.save("data/embedding_triplets.json")
```

### 3. Classification Data Generation

Generate labeled data for text classification.

```python
from scripts.generate_classification_data import ClassificationGenerator

generator = ClassificationGenerator(
    teacher_model="gpt-4",
    labels={
        "tech_job": "技術類職缺",
        "sales_job": "業務類職缺",
        "admin_job": "行政類職缺",
        "creative_job": "創意類職缺",
    },
)

data = generator.generate(
    num_per_class=200,
    style="job_posting",
)

generator.save("data/job_classification.json")
```

### 4. Instruction Data Generation

Generate instruction-following data for fine-tuning chatbots.

```python
from scripts.generate_instruction_data import InstructionGenerator

generator = InstructionGenerator(
    teacher_model="claude-3-opus",
    domain="HR Chatbot",
)

data = generator.generate(
    num_samples=1000,
    task_types=[
        "job_recommendation",     # 根據條件推薦職缺
        "resume_feedback",        # 履歷修改建議
        "salary_negotiation",     # 薪資談判建議
        "interview_tips",         # 面試技巧
    ],
)

generator.save("data/hr_instructions.json")
```

## Prompt Engineering for Data Generation

### Key Principles

1. **Clear Task Definition**: Tell teacher exactly what to generate
2. **Output Format Specification**: Use JSON schema for structured output
3. **Few-shot Examples**: Provide 3-5 high-quality examples
4. **Diversity Instructions**: Explicitly ask for variety
5. **Domain Context**: Provide domain-specific knowledge

### Example: NER Generation Prompt

See `references/prompt_templates.md` for complete templates.

```
You are a data annotation expert. Generate training data for NER.

Entity Types:
- JOB_TITLE: 職稱 (e.g., 軟體工程師、產品經理)
- SKILL: 技能 (e.g., Python、機器學習)
- COMPANY: 公司 (e.g., 台積電、Google)

Output Format (JSON):
{"tokens": ["我", "在", "台積電", "工作"], "ner_tags": ["O", "O", "B-COMPANY", "O"]}

Generate 10 diverse job posting sentences with entity annotations.
Vary: sentence length, entity combinations, writing style.
```

## Quality Control

### Validation Pipeline

```python
from scripts.quality_control import QualityController

qc = QualityController()

# Validate NER data
validated_data = qc.validate_ner(
    data,
    checks=[
        "format_valid",           # JSON 格式正確
        "bio_consistency",        # BIO 標籤一致
        "entity_coverage",        # 實體類型覆蓋
        "length_distribution",    # 長度分布合理
    ],
)

# Deduplicate
deduplicated = qc.deduplicate(validated_data, similarity_threshold=0.9)

# Filter by confidence (if teacher provides)
high_quality = qc.filter_by_confidence(deduplicated, min_confidence=0.8)
```

### Quality Metrics

| Metric | Target | Description |
|--------|--------|-------------|
| Format Validity | 100% | All outputs parse correctly |
| Entity Coverage | >90% | All entity types represented |
| Diversity Score | >0.7 | Low duplicate/similar samples |
| Length Variance | σ>10 | Good length distribution |

## Cost Optimization

### Batch Generation

```python
# Generate in batches to reduce API calls
generator.generate_batch(
    num_samples=1000,
    samples_per_call=10,  # Generate 10 samples per API call
    parallel_calls=5,      # 5 concurrent requests
)
```

### Model Selection

| Teacher Model | Quality | Cost | Speed | Best For |
|---------------|---------|------|-------|----------|
| GPT-4 | ★★★★★ | $$$$ | Slow | Complex tasks, high quality |
| GPT-4o-mini | ★★★★ | $$ | Fast | Balanced cost/quality |
| Claude 3 Opus | ★★★★★ | $$$$ | Slow | Nuanced understanding |
| Claude 3 Haiku | ★★★ | $ | Fast | Large volume generation |
| Local LLM | ★★★ | Free | Varies | Privacy, no API limits |

### Token Estimation

```python
from scripts.utils import estimate_cost

cost = estimate_cost(
    num_samples=1000,
    avg_tokens_per_sample=200,
    model="gpt-4",
)
print(f"Estimated cost: ${cost:.2f}")
```

## Integration with Training

After generating data, use with `huggingface-nlp-trainer` skill:

```python
# 1. Generate NER data with teacher
from scripts.generate_ner_data import NERDataGenerator
generator = NERDataGenerator(teacher_model="gpt-4", ...)
generator.generate(num_samples=5000)
generator.save("data/ner_train.json")

# 2. Train student model
from huggingface_nlp_trainer.scripts.train_ner import NERTrainer
trainer = NERTrainer(
    model_name="ckiplab/bert-base-chinese-ner",
    train_file="data/ner_train.json",
)
trainer.train()
```

## Additional Resources

- **Prompt templates**: See `references/prompt_templates.md`
- **Quality control guide**: See `references/quality_control.md`
- **Distillation strategies**: See `references/distillation_strategies.md`