# aiperf-benchmark > Benchmarking AI models with Nvidia Aiperf and analyzing benchmark results. Use when the user wants to run performance benchmarks on LLM inference endpoints, analyze Aiperf CSV/JSON benchmark output files, generate performance reports from benchmark data, compare model performance metrics like TTFT, ITL, throughput, or set up benchmark configurations for vLLM, TGI, or other inference servers. Triggers on keywords like aiperf, benchmark, TTFT, ITL, throughput, inference performance, model benchmark. - Author: Nguyen Tien Hoang - Repository: hnt2601/claude-skills - Version: 20260202221033 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/hnt2601/claude-skills - Web: https://mule.run/skillshub/@@hnt2601/claude-skills~aiperf-benchmark:20260202221033 --- --- name: aiperf-benchmark description: "Benchmarking AI models with Nvidia Aiperf and analyzing benchmark results. Use when the user wants to run performance benchmarks on LLM inference endpoints, analyze Aiperf CSV/JSON benchmark output files, generate performance reports from benchmark data, compare model performance metrics like TTFT, ITL, throughput, or set up benchmark configurations for vLLM, TGI, or other inference servers. Triggers on keywords like aiperf, benchmark, TTFT, ITL, throughput, inference performance, model benchmark." --- # Aiperf Benchmark Skill Aiperf (AI Performance) is a comprehensive benchmarking tool from NVIDIA's ai-dynamo project that measures performance of generative AI models served by inference solutions. ## Installation ```bash pip install aiperf --break-system-packages ``` ## Quick Start Commands ### Basic Chat Benchmarking ```bash aiperf profile --model --url --endpoint-type chat --streaming ``` ### Concurrency-Based Benchmarking ```bash aiperf profile --model --url --concurrency 10 --request-count 100 ``` ### Request Rate Benchmarking (Poisson Distribution) ```bash aiperf profile --model --url --request-rate 5.0 --benchmark-duration 60 ``` ### Multi-Turn Conversations with ShareGPT ```bash aiperf profile --model --url --public-dataset sharegpt --num-sessions 50 ``` ## Key CLI Options See [references/cli_options.md](references/cli_options.md) for the complete CLI reference. ### Essential Parameters | Parameter | Description | |-----------|-------------| | `-m, --model` | Model name(s) to benchmark (required) | | `-u, --url` | Server URL (default: localhost:8000) | | `--endpoint-type` | API type: chat, completions, embeddings, etc. | | `--streaming` | Enable streaming responses for TTFT/ITL metrics | ### Load Configuration | Parameter | Description | |-----------|-------------| | `--concurrency` | Number of concurrent requests to maintain | | `--request-rate` | Target requests per second | | `--request-count` | Maximum number of requests to send | | `--benchmark-duration` | Maximum benchmark runtime in seconds | | `--arrival-pattern` | constant, poisson (default), gamma, concurrency_burst | ### Input Configuration | Parameter | Description | |-----------|-------------| | `--isl` | Mean input sequence length (tokens) | | `--isl-stddev` | Standard deviation for input length | | `--osl` | Mean output sequence length (tokens) | | `--osl-stddev` | Standard deviation for output length | | `--input-file` | Custom dataset path (JSONL) | | `--public-dataset` | Use public dataset (e.g., sharegpt) | ### Output Configuration | Parameter | Description | |-----------|-------------| | `--artifact-dir` | Output directory (default: artifacts) | | `--export-level` | summary, records (default), or raw | | `--slice-duration` | Duration for time-sliced analysis | ## Output Files Aiperf generates several output files in the artifact directory: - `profile_export_aiperf.csv` - Summary metrics in CSV - `profile_export_aiperf.json` - Summary with metadata - `profile_export.jsonl` - Per-request metrics - `profile_export_raw.jsonl` - Raw request/response data (if --export-level raw) - `*_timeslices.csv` - Time-windowed metrics (if --slice-duration set) - `*_gpu_telemetry.jsonl` - GPU metrics (if --gpu-telemetry enabled) - `*_server_metrics.*` - Server-side Prometheus metrics ## Analyzing Benchmark Results Use `scripts/analyze_benchmark.py` to analyze CSV output: ```bash python scripts/analyze_benchmark.py /path/to/profile_export_aiperf.csv ``` ### Key Metrics in Output | Metric | Description | |--------|-------------| | `time_to_first_token_s` | Time to first token (TTFT) | | `inter_token_latency_s` | Inter-token latency (ITL) | | `request_latency_s` | End-to-end request latency | | `output_token_throughput_per_request` | Tokens/second per request | | `input_tokens`, `output_tokens` | Token counts | | `successful_requests`, `failed_requests` | Request status | ### CSV Analysis Workflow 1. Load the CSV with pandas 2. Filter by successful requests 3. Calculate percentiles (p50, p90, p95, p99) for latency metrics 4. Compute aggregate throughput 5. Generate comparison charts if multiple runs ```python import pandas as pd df = pd.read_csv('profile_export_aiperf.csv') # Filter successful requests df_success = df[df['request_output_error'].isna()] # Key metrics print(f"TTFT p50: {df_success['time_to_first_token_s'].quantile(0.5):.3f}s") print(f"TTFT p99: {df_success['time_to_first_token_s'].quantile(0.99):.3f}s") print(f"ITL p50: {df_success['inter_token_latency_s'].quantile(0.5)*1000:.2f}ms") print(f"Throughput: {df_success['output_token_throughput_per_request'].mean():.1f} tok/s") ``` ## Visualization Use `aiperf plot` to generate visualizations: ```bash aiperf plot --paths ./artifacts --output ./plots ``` Or launch interactive dashboard: ```bash aiperf plot --dashboard --port 8050 ``` ## Common Benchmark Scenarios ### Latency-Focused (Interactive Use) ```bash aiperf profile --model --url --streaming \ --concurrency 1 --request-count 100 --isl 512 --osl 256 ``` ### Throughput-Focused (Batch Processing) ```bash aiperf profile --model --url \ --concurrency 32 --request-rate 10 --benchmark-duration 300 ``` ### Goodput with SLOs ```bash aiperf profile --model --url --streaming \ --concurrency 16 --goodput "request_latency:250 inter_token_latency:10" ``` ### KV Cache Testing ```bash aiperf profile --model --url --streaming \ --num-prefix-prompts 10 --prefix-prompt-length 2048 \ --isl 512 --osl 128 --concurrency 8 ``` ## Endpoint Types | Type | Description | |------|-------------| | `chat` | OpenAI Chat Completions (default) | | `completions` | OpenAI Completions (legacy) | | `embeddings` | Vector embeddings generation | | `rankings` | Passage reranking | | `image_generation` | Image generation (FLUX.1, etc.) | | `huggingface_generate` | HuggingFace TGI API |