# deterministic-inference > Techniques for achieving reproducible LLM inference outputs. Use when: requiring exact reproducibility, batch-invariant inference, RL training reproducibility, consistent heuristic evaluation. Supports: SGLang deterministic mode, seed configuration, validation. - Author: Monxun - Repository: Monxun/monxun-plugin-marketplace - Version: 20260131184524 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/Monxun/monxun-plugin-marketplace - Web: https://mule.run/skillshub/@@Monxun/monxun-plugin-marketplace~deterministic-inference:20260131184524 --- --- name: deterministic-inference description: | Techniques for achieving reproducible LLM inference outputs. Use when: requiring exact reproducibility, batch-invariant inference, RL training reproducibility, consistent heuristic evaluation. Supports: SGLang deterministic mode, seed configuration, validation. --- # Deterministic Inference Skill ## Quick Start Achieve mathematically reproducible LLM outputs for consistent heuristic evaluation and validation. ### The Challenge Even with `temperature=0`, LLM inference is non-deterministic due to: - Batch-sensitive kernel operations (primary cause) - Dynamic batching in servers - Radix cache behavior ### Solution Use batch-invariant kernels (SGLang) for true determinism. ## Configuration ### SGLang (Recommended) ```bash python -m sglang.launch_server \ --model your-model \ --enable-deterministic-inference \ --attention-backend flashinfer ``` ### Multi-Provider Settings | Provider | Method | Level | |----------|--------|-------| | SGLang | `--enable-deterministic-inference` | Perfect | | OpenAI | `seed` parameter | Best-effort | | Anthropic | `temperature=0` | Near-deterministic | | vLLM | `seed` + env flag | High | | llama.cpp | `seed=42, temp=0, top_p=1, top_k=1` | High | ## Core Components ### Batch-Invariant Operations Three operations require special handling: 1. **RMSNorm** - Normalization layer 2. **Matrix Multiplication** - Core computation 3. **Attention** - Self-attention mechanism ### Validation Tests ```bash # Single prompt, varying batch sizes python -m sglang.test.test_deterministic --test-mode single # Mixed prompts in same batch python -m sglang.test.test_deterministic --test-mode mixed # Prefix cache consistency python -m sglang.test.test_deterministic --test-mode prefix ``` ## Performance Trade-offs | Configuration | Slowdown | Reproducibility | |---------------|----------|-----------------| | Default | 0% | ~80% | | Deterministic | 34.35% | 100% | | + CUDA graphs | 12% | 100% | ## Validation Pattern ```python class DeterministicValidator: def test_single(self, prompt: str, n_runs: int = 50) -> bool: """Same prompt across varying batch sizes.""" outputs = set() for batch_size in range(1, n_runs + 1): result = self.model.generate(prompt, batch_size=batch_size) outputs.add(result) return len(outputs) == 1 # Must be 1 for determinism ``` ## When to Use | Scenario | Determinism Needed | |----------|-------------------| | Heuristic evaluation | Yes (critical) | | RL training | Yes (critical) | | Production inference | Usually no | | Debugging | Yes (helpful) | ## Additional Resources - For SGLang setup: [sglang-setup.md](references/sglang-setup.md) - For validation suite: [validation-suite.md](references/validation-suite.md) - For performance tuning: [performance-tuning.md](references/performance-tuning.md) ## Research Foundation Based on: "Towards Deterministic Inference in SGLang" - Source: Thinking Machines Lab - Blog: lmsys.org/blog/2025-09-22-sglang-deterministic/