# hf-model-inference

> Guidance for setting up HuggingFace model inference services with Flask APIs. This skill applies when downloading HuggingFace models, creating inference endpoints, or building ML model serving APIs. Use for tasks involving transformers library, model caching, and REST API creation for ML models.

- Author: nouamanecodes
- Repository: nouamanecodes/skills
- Version: 20260113202157
- Stars: 0
- Forks: 0
- Last Updated: 2026-02-07
- Source: https://github.com/nouamanecodes/skills
- Web: https://mule.run/skillshub/@@nouamanecodes/skills~hf-model-inference:20260113202157

---

---
name: hf-model-inference
description: Guidance for setting up HuggingFace model inference services with Flask APIs. This skill applies when downloading HuggingFace models, creating inference endpoints, or building ML model serving APIs. Use for tasks involving transformers library, model caching, and REST API creation for ML models.
---

# HuggingFace Model Inference Service

## Overview

This skill provides procedural guidance for setting up HuggingFace model inference services. It covers model downloading, caching strategies, Flask API creation, and service deployment patterns.

## Workflow

### Phase 1: Environment Setup

1. **Verify package manager availability**
   - Check for `uv`, `pip`, or `conda` before installing dependencies
   - Prefer `uv` for faster dependency resolution when available

2. **Install required packages**
   - Core: `transformers`, `torch` (or `tensorflow`)
   - API: `flask` for REST endpoints
   - Set appropriate timeouts for large package installations (300+ seconds)

3. **Create model cache directory**
   - Establish a dedicated directory for model storage (e.g., `/app/model_cache/model_name`)
   - Create parent directories as needed before downloading

### Phase 2: Model Download

1. **Download the model separately from API startup**
   - Use a dedicated download script or inline download before starting the service
   - This prevents timeout issues during API initialization

2. **Specify cache directory explicitly**
   ```python
   from transformers import pipeline
   model = pipeline("task-type", model="model-name", cache_dir="/path/to/cache")
   ```

3. **Verification step** (commonly missed)
   - After download, verify model files exist in the target directory
   - List directory contents to confirm successful download

### Phase 3: API Creation

1. **Flask application structure**
   ```python
   from flask import Flask, request, jsonify
   from transformers import pipeline

   app = Flask(__name__)
   model = None  # Load at startup

   @app.route('/predict', methods=['POST'])
   def predict():
       # Handle inference
       pass
   ```

2. **Input validation requirements**
   - Check for required fields in request JSON
   - Validate field types (string, number, etc.)
   - Handle empty or whitespace-only inputs
   - Return descriptive error messages with appropriate HTTP status codes

3. **Error response format**
   - Use consistent JSON structure: `{"error": "message"}`
   - Return 400 for client errors, 500 for server errors

### Phase 4: Service Deployment

1. **Host and port configuration**
   - Bind to `0.0.0.0` for external accessibility
   - Use specified port (commonly 5000)
   - Example: `app.run(host='0.0.0.0', port=5000)`

2. **Background execution**
   - Start Flask in background mode for testing
   - Allow startup time (2-3 seconds) before sending test requests

## Verification Strategies

### Model Download Verification
- List cache directory contents after download
- Confirm expected model files exist (config.json, model weights, tokenizer files)

### API Functionality Testing
Test these scenarios in order:

1. **Positive case**: Valid input that should succeed
   ```bash
   curl -X POST http://localhost:5000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "valid input text"}'
   ```

2. **Negative case**: Different valid input to verify varied responses
   ```bash
   curl -X POST http://localhost:5000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "different input text"}'
   ```

3. **Error case**: Missing required field
   ```bash
   curl -X POST http://localhost:5000/predict \
     -H "Content-Type: application/json" \
     -d '{}'
   ```

### Extended Edge Cases (Optional)
- Empty string input
- Very long text input
- Non-JSON content type
- Malformed JSON
- Wrong field type (number instead of string)

## Common Pitfalls

### Installation Issues
- **Insufficient timeout**: Large packages like `torch` require extended timeouts (5+ minutes)
- **Missing system dependencies**: Some models require additional system packages

### Model Loading Issues
- **Cold start timeout**: Loading models at first request causes timeouts; load at startup instead
- **Memory constraints**: Large models may exceed available RAM; check model requirements

### API Issues
- **Development server warning**: Flask development server is not suitable for production; acceptable for testing but note the limitation
- **No graceful shutdown**: Consider signal handling for clean termination
- **No health check endpoint**: Adding `/health` endpoint aids debugging

### Process Management
- **Background process verification**: After starting in background, verify the process is running
- **Port conflicts**: Check if the specified port is already in use before starting

## Task Planning Template

When approaching HuggingFace inference tasks, structure work as follows:

1. Environment verification (package manager, system requirements)
2. Dependency installation with appropriate timeouts
3. Cache directory creation
4. Model download with explicit cache path
5. Model download verification
6. API script creation with validation
7. Service startup in background
8. Functional testing (positive, negative, error cases)
9. Edge case testing (if time permits)