# using-spacy-nlp

> Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.

- Author: Rick Hightower
- Repository: SpillwaveSolutions/spacy-nlp-agentic-skill
- Version: 20260107161947
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/SpillwaveSolutions/spacy-nlp-agentic-skill
- Web: https://mule.run/skillshub/@@SpillwaveSolutions/spacy-nlp-agentic-skill~using-spacy-nlp:20260107161947

---

---
name: using-spacy-nlp
description: Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.
---

# spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

## Contents

- [Quick Start](#quick-start)
- [Installation](#installation)
- [Text Processing](#text-processing)
- [Training Classifiers](#training-classifiers)
- [Troubleshooting](#troubleshooting)
- [Production Deployment](#production-deployment)

---

## Scope

**In Scope:**
- spaCy 3.x installation and text processing
- TextCategorizer training for document classification
- Production deployment and optimization patterns

**Out of Scope (use other tools/skills):**
- Training custom NER models (different workflow)
- spaCy 2.x (deprecated, incompatible with 3.x)
- Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
- Custom tokenizers or language models

---

## Quick Start

```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)
```

---

## Installation

### Standard Setup

```bash
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
```

### Model Selection

| Model | Size | Speed | Use Case |
|-------|------|-------|----------|
| `en_core_web_sm` | 12 MB | Fastest | Prototyping, speed-critical |
| `en_core_web_md` | 40 MB | Fast | General use with word vectors |
| `en_core_web_lg` | 560 MB | Fast | Semantic similarity tasks |
| `en_core_web_trf` | 438 MB | Slow | Maximum accuracy (GPU) |

### Verify Installation

```python
import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")
```

**For detailed installation options** (conda, GPU, transformers): See [references/installation.md](references/installation.md)

---

## Text Processing

### Basic Pipeline

```python
nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")
```

### Named Entity Recognition

```python
for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON
```

**For entity types, filtering, and span details**: See [references/basic-usage.md](references/basic-usage.md#named-entity-recognition)

### Batch Processing (Critical for Production)

```python
# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))
```

### Disable Unused Components

```python
# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])
```

**For Doc/Token/Span details, noun chunks, similarity**: See [references/basic-usage.md](references/basic-usage.md)

---

## Training Classifiers

Train custom text classifiers with TextCategorizer.

### Workflow Overview

1. **Prepare data** → Run `scripts/prepare_training_data.py`
2. **Generate config** → Run `scripts/generate_config.py` or use `assets/config_textcat.cfg`
3. **Validate** → `python -m spacy debug data config.cfg` (catches issues before training)
4. **Train** → `python -m spacy train config.cfg --output ./output`
5. **Evaluate** → Run `scripts/evaluate_model.py`
6. **Use** → `nlp = spacy.load("./output/model-best")`

### Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

```json
[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]
```

Convert with script:

```bash
python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8
```

### Training Command

```bash
# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0
```

### Using Trained Model

```python
nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%
```

**For detailed training guide**: See [references/text-classification.md](references/text-classification.md)

---

## Troubleshooting

### Model Not Found (E050)

```
OSError: [E050] Can't find model 'en_core_web_sm'
```

**Fix:**
```bash
python -m spacy download en_core_web_sm
```

**Alternative (avoids path issues):**
```python
import en_core_web_sm
nlp = en_core_web_sm.load()
```

### Memory Issues

**Symptoms:** OOM errors, slow processing

**Fixes:**
```python
# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)
```

### GPU Not Working

```python
import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU
```

### Version Compatibility

spaCy 2.x models **do not work** with spaCy 3.x. Check compatibility:

```bash
python -m spacy validate
```

**For more troubleshooting**: See [references/troubleshooting.md](references/troubleshooting.md)

---

## Production Deployment

### Package Model

```bash
python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/
```

### FastAPI Server

Use the production template:

```bash
python scripts/serve_model.py --model ./output/model-best --port 8000
```

Or customize from template:

```python
from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }
```

### Performance Optimization

| Technique | Speedup | When to Use |
|-----------|---------|-------------|
| Disable components | 2-3x | Don't need all annotations |
| `nlp.pipe()` | 5-10x | Processing multiple texts |
| Multiprocessing | 2-4x | CPU-bound, many cores |
| GPU | 2-5x | Transformer models |

**For evaluation metrics and hyperparameter tuning**: See [references/production.md](references/production.md)

---

## Scripts Reference

| Script | Purpose | Usage |
|--------|---------|-------|
| `prepare_training_data.py` | Convert JSON to DocBin | `python scripts/prepare_training_data.py --input data.json` |
| `generate_config.py` | Create training config | `python scripts/generate_config.py --categories "A,B,C"` |
| `evaluate_model.py` | Detailed metrics | `python scripts/evaluate_model.py --model ./output/model-best` |
| `serve_model.py` | FastAPI server | `python scripts/serve_model.py --model ./model --port 8000` |

---

## Assets Reference

| Asset | Purpose | Usage |
|-------|---------|-------|
| `config_textcat.cfg` | Base training config | Copy and customize for your labels |
| `training_data_template.json` | Data format example | Reference for preparing your data |