Test and evaluate LangGraph agents systematically. Covers dataset creation, custom evaluators, LLM-as-judge patterns with Gemini, and automated benchmarking. Use when building evaluation pipelines, comparing model versions, or measuring agent quality.