# huggingface-nlp-trainer > Train and fine-tune NLP models using Hugging Face Transformers for Chinese/Traditional Chinese tasks. Use this skill when the user needs to: (1) Train or fine-tune NER (Named Entity Recognition) models, (2) Train or fine-tune embedding models for semantic search/similarity, (3) Train word segmentation models, (4) Fine-tune BERT/RoBERTa for text classification, (5) Prepare datasets in proper format for NLP training, (6) Evaluate NLP model performance, or (7) Export models to ONNX/production formats. Supports Traditional Chinese with CKIP Lab models and multilingual embeddings with BGE models. - Author: Ryan - Repository: YENSTDi/agent_skills - Version: 20260203094911 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/YENSTDi/agent_skills - Web: https://mule.run/skillshub/@@YENSTDi/agent_skills~huggingface-nlp-trainer:20260203094911 --- --- name: huggingface-nlp-trainer description: | Train and fine-tune NLP models using Hugging Face Transformers for Chinese/Traditional Chinese tasks. Use this skill when the user needs to: (1) Train or fine-tune NER (Named Entity Recognition) models, (2) Train or fine-tune embedding models for semantic search/similarity, (3) Train word segmentation models, (4) Fine-tune BERT/RoBERTa for text classification, (5) Prepare datasets in proper format for NLP training, (6) Evaluate NLP model performance, or (7) Export models to ONNX/production formats. Supports Traditional Chinese with CKIP Lab models and multilingual embeddings with BGE models. --- # Hugging Face NLP Trainer Train and fine-tune NLP models for production use, optimized for Traditional Chinese and Taiwan-specific tasks. ## Quick Start ### Prerequisites ```bash # Create a virtual environment (recommended) python -m venv .venv && source .venv/bin/activate # Or use uv for faster installs # uv venv && source .venv/bin/activate pip install transformers datasets accelerate evaluate seqeval sentence-transformers # Optional: LoRA fine-tuning support pip install peft # Optional: ONNX export and quantization pip install optimum onnxruntime ``` ## Supported Tasks | Task | Recommended Base Model | Script | |------|------------------------|--------| | NER | `ckiplab/bert-base-chinese-ner` | `scripts/train_ner.py` | | Word Segmentation | `ckiplab/bert-base-chinese-ws` | `scripts/train_word_segmentation.py` | | Embedding | `BAAI/bge-large-zh-v1.5` or `BAAI/bge-m3` | `scripts/train_embedding.py` | | Text Classification | `ckiplab/bert-base-chinese` | `scripts/train_classifier.py` | ## Task-Specific Workflows ### 1. NER Training **Dataset format**: See `references/dataset_formats.md` for BIO/BIOES tagging format. ```python from scripts.train_ner import NERTrainer trainer = NERTrainer( model_name="ckiplab/bert-base-chinese-ner", train_file="data/train.json", eval_file="data/eval.json", label_list=["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"], output_dir="./ner_model" ) trainer.train(epochs=3, batch_size=16, learning_rate=2e-5) trainer.evaluate() ``` **Custom entity types for HR domain**: - `JOB_TITLE`: 職稱 (e.g., 軟體工程師、產品經理) - `SKILL`: 技能 (e.g., Python、機器學習) - `COMPANY`: 公司名稱 - `EDUCATION`: 學歷 (e.g., 台大資工所) - `CERT`: 證照 (e.g., AWS SAA、PMP) ### 2. Embedding Model Training **Two training approaches**: 1. **Contrastive Learning** (recommended for semantic search): ```python from scripts.train_embedding import EmbeddingTrainer trainer = EmbeddingTrainer( model_name="BAAI/bge-large-zh-v1.5", train_file="data/pairs.json", # {"query": "...", "positive": "...", "negative": "..."} output_dir="./embedding_model" ) trainer.train_contrastive(epochs=3, batch_size=32) ``` 2. **Fine-tune for specific domain**: ```python trainer.train_domain_adaptation( corpus_file="data/job_descriptions.txt", epochs=1 ) ``` ### 3. Word Segmentation ```python from scripts.train_word_segmentation import WSTrainer trainer = WSTrainer( model_name="ckiplab/bert-base-chinese-ws", train_file="data/ws_train.txt", # Format: 我 愛 台灣 output_dir="./ws_model" ) trainer.train(epochs=5) ``` ## Model Selection Guide ### For Traditional Chinese (繁體中文) | Use Case | Model | Size | Notes | |----------|-------|------|-------| | General NER | `ckiplab/bert-base-chinese-ner` | 400MB | Best for Traditional Chinese | | Fast NER | `ckiplab/bert-tiny-chinese-ner` | 50MB | 4x faster, slight accuracy drop | | Word Seg | `ckiplab/bert-base-chinese-ws` | 400MB | CKIP standard | | POS Tagging | `ckiplab/bert-base-chinese-pos` | 400MB | Part-of-speech | ### For Embeddings | Use Case | Model | Dim | Notes | |----------|-------|-----|-------| | Chinese only | `BAAI/bge-large-zh-v1.5` | 1024 | Best Chinese performance | | Multilingual | `BAAI/bge-m3` | 1024 | Supports 100+ languages | | Fast/Small | `BAAI/bge-small-zh-v1.5` | 512 | 6x faster | | Instruction-tuned | `intfloat/multilingual-e5-large-instruct` | 1024 | Instruction-following embeddings | | Latest multilingual | `jinaai/jina-embeddings-v3` | 1024 | Task-specific LoRA heads | | Reranking | `BAAI/bge-reranker-v2-m3` | - | Two-stage retrieval | ## Training Best Practices ### Hardware Requirements | Task | Min GPU VRAM | Recommended | |------|--------------|-------------| | NER/WS (bert-base) | 8GB | 16GB | | NER/WS (bert-tiny) | 4GB | 8GB | | Embedding (bge-large) | 16GB | 24GB | | Embedding (bge-small) | 8GB | 16GB | ### Hyperparameters ```python # NER/Token Classification default_ner_config = { "learning_rate": 2e-5, "batch_size": 16, "epochs": 3, "warmup_ratio": 0.1, "weight_decay": 0.01, "max_length": 512 } # Embedding Contrastive Learning default_embedding_config = { "learning_rate": 1e-5, "batch_size": 32, "epochs": 3, "temperature": 0.05, "max_length": 256 } ``` ### LoRA Fine-Tuning Use LoRA (Low-Rank Adaptation) to fine-tune large models with significantly less GPU memory: ```python from scripts.train_ner import NERTrainer, NERConfig config = NERConfig( model_name="ckiplab/bert-base-chinese-ner", use_lora=True, lora_r=8, lora_alpha=16, lora_dropout=0.1, ) trainer = NERTrainer( model_name=config.model_name, train_file="data/train.json", output_dir="./ner_lora_model", config=config, ) trainer.train() ``` LoRA benefits: - ~70% less GPU memory usage - Faster training with fewer trainable parameters - Supported for NER, Word Segmentation, and Text Classification tasks - Requires: `pip install peft` ### Data Augmentation For NER with limited data: ```python from scripts.utils import augment_ner_data augmented = augment_ner_data( train_data, techniques=["synonym_replace", "random_swap", "entity_replace"] ) ``` ## Evaluation Metrics | Task | Primary Metric | Secondary | |------|----------------|-----------| | NER | F1 (entity-level) | Precision, Recall | | Word Seg | F1 (token-level) | Accuracy | | Embedding | NDCG@10, MRR | Recall@K | | Classification | Accuracy, Macro-F1 | Confusion Matrix | ## Export for Production ### ONNX Export ```python from scripts.export_model import export_to_onnx export_to_onnx( model_path="./ner_model", output_path="./ner_model.onnx", opset_version=14 ) ``` ### Quantization ```python from scripts.export_model import quantize_model quantize_model( model_path="./ner_model.onnx", output_path="./ner_model_int8.onnx", quantization_type="dynamic" # or "static" ) ``` ## Additional Resources - **Dataset formats**: See `references/dataset_formats.md` - **Chinese NLP models catalog**: See `references/chinese_nlp_models.md` - **Training configurations**: See `references/training_configs.md`