# huggingface-nlp-trainer

> Train and fine-tune NLP models using Hugging Face Transformers for Chinese/Traditional Chinese tasks.
Use this skill when the user needs to: (1) Train or fine-tune NER (Named Entity Recognition) models,
(2) Train or fine-tune embedding models for semantic search/similarity, (3) Train word segmentation models,
(4) Fine-tune BERT/RoBERTa for text classification, (5) Prepare datasets in proper format for NLP training,
(6) Evaluate NLP model performance, or (7) Export models to ONNX/production formats.
Supports Traditional Chinese with CKIP Lab models and multilingual embeddings with BGE models.

- Author: Ryan
- Repository: YENSTDi/agent_skills
- Version: 20260203094911
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/YENSTDi/agent_skills
- Web: https://mule.run/skillshub/@@YENSTDi/agent_skills~huggingface-nlp-trainer:20260203094911

---

---
name: huggingface-nlp-trainer
description: |
  Train and fine-tune NLP models using Hugging Face Transformers for Chinese/Traditional Chinese tasks.
  Use this skill when the user needs to: (1) Train or fine-tune NER (Named Entity Recognition) models,
  (2) Train or fine-tune embedding models for semantic search/similarity, (3) Train word segmentation models,
  (4) Fine-tune BERT/RoBERTa for text classification, (5) Prepare datasets in proper format for NLP training,
  (6) Evaluate NLP model performance, or (7) Export models to ONNX/production formats.
  Supports Traditional Chinese with CKIP Lab models and multilingual embeddings with BGE models.
---

# Hugging Face NLP Trainer

Train and fine-tune NLP models for production use, optimized for Traditional Chinese and Taiwan-specific tasks.

## Quick Start

### Prerequisites

```bash
# Create a virtual environment (recommended)
python -m venv .venv && source .venv/bin/activate

# Or use uv for faster installs
# uv venv && source .venv/bin/activate

pip install transformers datasets accelerate evaluate seqeval sentence-transformers

# Optional: LoRA fine-tuning support
pip install peft

# Optional: ONNX export and quantization
pip install optimum onnxruntime
```

## Supported Tasks

| Task | Recommended Base Model | Script |
|------|------------------------|--------|
| NER | `ckiplab/bert-base-chinese-ner` | `scripts/train_ner.py` |
| Word Segmentation | `ckiplab/bert-base-chinese-ws` | `scripts/train_word_segmentation.py` |
| Embedding | `BAAI/bge-large-zh-v1.5` or `BAAI/bge-m3` | `scripts/train_embedding.py` |
| Text Classification | `ckiplab/bert-base-chinese` | `scripts/train_classifier.py` |

## Task-Specific Workflows

### 1. NER Training

**Dataset format**: See `references/dataset_formats.md` for BIO/BIOES tagging format.

```python
from scripts.train_ner import NERTrainer

trainer = NERTrainer(
    model_name="ckiplab/bert-base-chinese-ner",
    train_file="data/train.json",
    eval_file="data/eval.json",
    label_list=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"],
    output_dir="./ner_model"
)
trainer.train(epochs=3, batch_size=16, learning_rate=2e-5)
trainer.evaluate()
```

**Custom entity types for HR domain**:
- `JOB_TITLE`: 職稱 (e.g., 軟體工程師、產品經理)
- `SKILL`: 技能 (e.g., Python、機器學習)
- `COMPANY`: 公司名稱
- `EDUCATION`: 學歷 (e.g., 台大資工所)
- `CERT`: 證照 (e.g., AWS SAA、PMP)

### 2. Embedding Model Training

**Two training approaches**:

1. **Contrastive Learning** (recommended for semantic search):
```python
from scripts.train_embedding import EmbeddingTrainer

trainer = EmbeddingTrainer(
    model_name="BAAI/bge-large-zh-v1.5",
    train_file="data/pairs.json",  # {"query": "...", "positive": "...", "negative": "..."}
    output_dir="./embedding_model"
)
trainer.train_contrastive(epochs=3, batch_size=32)
```

2. **Fine-tune for specific domain**:
```python
trainer.train_domain_adaptation(
    corpus_file="data/job_descriptions.txt",
    epochs=1
)
```

### 3. Word Segmentation

```python
from scripts.train_word_segmentation import WSTrainer

trainer = WSTrainer(
    model_name="ckiplab/bert-base-chinese-ws",
    train_file="data/ws_train.txt",  # Format: 我 愛 台灣
    output_dir="./ws_model"
)
trainer.train(epochs=5)
```

## Model Selection Guide

### For Traditional Chinese (繁體中文)

| Use Case | Model | Size | Notes |
|----------|-------|------|-------|
| General NER | `ckiplab/bert-base-chinese-ner` | 400MB | Best for Traditional Chinese |
| Fast NER | `ckiplab/bert-tiny-chinese-ner` | 50MB | 4x faster, slight accuracy drop |
| Word Seg | `ckiplab/bert-base-chinese-ws` | 400MB | CKIP standard |
| POS Tagging | `ckiplab/bert-base-chinese-pos` | 400MB | Part-of-speech |

### For Embeddings

| Use Case | Model | Dim | Notes |
|----------|-------|-----|-------|
| Chinese only | `BAAI/bge-large-zh-v1.5` | 1024 | Best Chinese performance |
| Multilingual | `BAAI/bge-m3` | 1024 | Supports 100+ languages |
| Fast/Small | `BAAI/bge-small-zh-v1.5` | 512 | 6x faster |
| Instruction-tuned | `intfloat/multilingual-e5-large-instruct` | 1024 | Instruction-following embeddings |
| Latest multilingual | `jinaai/jina-embeddings-v3` | 1024 | Task-specific LoRA heads |
| Reranking | `BAAI/bge-reranker-v2-m3` | - | Two-stage retrieval |

## Training Best Practices

### Hardware Requirements

| Task | Min GPU VRAM | Recommended |
|------|--------------|-------------|
| NER/WS (bert-base) | 8GB | 16GB |
| NER/WS (bert-tiny) | 4GB | 8GB |
| Embedding (bge-large) | 16GB | 24GB |
| Embedding (bge-small) | 8GB | 16GB |

### Hyperparameters

```python
# NER/Token Classification
default_ner_config = {
    "learning_rate": 2e-5,
    "batch_size": 16,
    "epochs": 3,
    "warmup_ratio": 0.1,
    "weight_decay": 0.01,
    "max_length": 512
}

# Embedding Contrastive Learning
default_embedding_config = {
    "learning_rate": 1e-5,
    "batch_size": 32,
    "epochs": 3,
    "temperature": 0.05,
    "max_length": 256
}
```

### LoRA Fine-Tuning

Use LoRA (Low-Rank Adaptation) to fine-tune large models with significantly less GPU memory:

```python
from scripts.train_ner import NERTrainer, NERConfig

config = NERConfig(
    model_name="ckiplab/bert-base-chinese-ner",
    use_lora=True,
    lora_r=8,
    lora_alpha=16,
    lora_dropout=0.1,
)

trainer = NERTrainer(
    model_name=config.model_name,
    train_file="data/train.json",
    output_dir="./ner_lora_model",
    config=config,
)
trainer.train()
```

LoRA benefits:
- ~70% less GPU memory usage
- Faster training with fewer trainable parameters
- Supported for NER, Word Segmentation, and Text Classification tasks
- Requires: `pip install peft`

### Data Augmentation

For NER with limited data:
```python
from scripts.utils import augment_ner_data

augmented = augment_ner_data(
    train_data,
    techniques=["synonym_replace", "random_swap", "entity_replace"]
)
```

## Evaluation Metrics

| Task | Primary Metric | Secondary |
|------|----------------|-----------|
| NER | F1 (entity-level) | Precision, Recall |
| Word Seg | F1 (token-level) | Accuracy |
| Embedding | NDCG@10, MRR | Recall@K |
| Classification | Accuracy, Macro-F1 | Confusion Matrix |

## Export for Production

### ONNX Export

```python
from scripts.export_model import export_to_onnx

export_to_onnx(
    model_path="./ner_model",
    output_path="./ner_model.onnx",
    opset_version=14
)
```

### Quantization

```python
from scripts.export_model import quantize_model

quantize_model(
    model_path="./ner_model.onnx",
    output_path="./ner_model_int8.onnx",
    quantization_type="dynamic"  # or "static"
)
```

## Additional Resources

- **Dataset formats**: See `references/dataset_formats.md`
- **Chinese NLP models catalog**: See `references/chinese_nlp_models.md`
- **Training configurations**: See `references/training_configs.md`