# agentic-optimization-craft > Use this skill when the user invokes /optimize, /optimize-bootstrap, asks to "improve my agent", "run optimization loop", "iterate on agent quality", "systematic agent improvement", "bootstrap optimization", "build evaluation infrastructure", or needs guidance on hypothesis-driven optimization cycles. Provides the full methodology for iterative AI agent improvement in Codex, including bootstrap workflows for users without existing datasets or graders. - Author: MB9012 - Repository: mberto10/claude-marketplace - Version: 20260206230906 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-07 - Source: https://github.com/mberto10/claude-marketplace - Web: https://mule.run/skillshub/@@mberto10/claude-marketplace~agentic-optimization-craft:20260206230906 --- --- name: agentic-optimization-craft description: Use this skill when the user invokes /optimize, /optimize-bootstrap, asks to "improve my agent", "run optimization loop", "iterate on agent quality", "systematic agent improvement", "bootstrap optimization", "build evaluation infrastructure", or needs guidance on hypothesis-driven optimization cycles. Provides the full methodology for iterative AI agent improvement in Codex, including bootstrap workflows for users without existing datasets or graders. --- # Agentic Optimization Craft A systematic methodology for iterative AI agent improvement through hypothesis-driven experimentation. This skill guides you through the complete optimization loop with persistent state. ## The Core Philosophy **Evaluation-first development:** Create evaluations BEFORE making changes. This prevents solving imaginary problems and provides clear signal on whether changes helped. **Hypothesis-driven iteration:** Every change starts with a specific, testable hypothesis. No shotgun debugging or random prompt tweaks. **Compounding returns:** Each iteration makes both the agent AND the evaluation system better. Failures become test cases, learnings accumulate. Reference: `references/compounding-methodology.md` --- ## The Optimization Loop ``` ┌─────────────────────────────────────────────────────────────────┐ │ THE OPTIMIZATION LOOP │ │ │ │ ┌──────────────┐ │ │ │ │ │ │ ┌───────────▶│ HYPOTHESIZE │◀─── What will improve it? │ │ │ │ │ │ │ │ └──────┬───────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ │ │ │ │ │ EXPERIMENT │◀─── Test the hypothesis │ │ │ │ │ │ │ │ └──────┬───────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ │ │ │ │ │ ANALYZE │◀─── Did it work? Why? │ │ │ │ │ │ │ │ └──────┬───────┘ │ │ │ │ │ │ │ ▼ │ │ │ ┌──────────────┐ │ │ │ │ │ │ │ └────────────│ COMPOUND │◀─── Capture learnings │ │ │ │ │ │ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` | Phase | Goal | Output | |-------|------|--------| | BOOTSTRAP | Build missing infrastructure | Dataset + graders ready | | INITIALIZE | Establish baseline | Journal with baseline metrics | | HYPOTHESIZE | Formulate testable improvement | Documented hypothesis + change plan | | EXPERIMENT | Test the hypothesis | Evaluation results | | ANALYZE | Understand results | Validated/invalidated + findings | | COMPOUND | Capture value | Grown dataset + learnings | --- ## Phase -1: BOOTSTRAP (Optional) **Goal:** Build evaluation infrastructure from human feedback or production traces. **When to use:** Start here if you have: - Human annotations/feedback but no automated graders - Production traces but no curated dataset - Freeform feedback that needs to be structured **Skip to Phase 0** if you already have a dataset and graders ready. ### Step 1: Assess Current State Gather information about what exists: ``` Assessment checklist: 1. **Traces**: Do you have Langfuse traces from your agent? [ ] Yes, production traces exist [ ] No, need to set up tracing first 2. **Human Feedback**: What form does your feedback take? [ ] Langfuse annotations (scores on traces) [ ] Annotation queue (traces flagged for review) [ ] Freeform feedback (Slack, email, tickets) [ ] No feedback yet 3. **Dataset**: Do you have an evaluation dataset? [ ] Yes, in Langfuse [ ] Yes, local files (JSON/CSV) [ ] No dataset yet 4. **Graders**: Do you have automated evaluation? [ ] Yes, LLM judges configured [ ] Yes, code-based checks [ ] No automated grading ``` ### Step 2: Route to Appropriate Workflow Based on assessment, follow the appropriate path: #### Path A: No Traces Yet If tracing isn't set up, you need it before bootstrapping: ``` You need tracing before we can bootstrap optimization. Use the langfuse-instrumentation-setup skill to add tracing: 1. Add Langfuse SDK to your agent 2. Instrument key operations 3. Run a few test cases to generate traces 4. Return here once you have traces ``` #### Path B: Traces + Human Annotations → Build Graders If you have traces with human scores but no automated graders: **Step B.1: Analyze annotation patterns** ```bash # Export existing human annotations python3 ~/.codex/skills/langfuse-annotation-manager/scripts/annotation_manager.py \ export --score-name "" --days 30 --format json ``` Review the exported annotations to understand: - Score distribution (what's the range?) - Common patterns in high-scoring vs low-scoring traces - Comments/reasoning from annotators **Step B.2: Sample traces for grader development** ```bash # Get high-quality examples python3 ~/.codex/skills/langfuse-data-retrieval/scripts/trace_retriever.py \ --last 10 --min-score 9.0 --mode io # Get low-quality examples python3 ~/.codex/skills/langfuse-data-retrieval/scripts/trace_retriever.py \ --last 10 --max-score 5.0 --mode io ``` **Step B.3: Draft grader prompt** Based on the patterns observed, create a grader: ``` Based on your human annotations, here's a draft grader prompt: --- You are evaluating an AI agent's response quality. Score from 1-10 based on: - [Criteria derived from annotation patterns] - [Criteria derived from annotation patterns] - [Criteria derived from annotation patterns] Examples of good responses (score 9+): [Insert examples from high-scoring traces] Examples of poor responses (score 5 or below): [Insert examples from low-scoring traces] --- Does this capture what your human annotators were evaluating? ``` **Step B.4: Create grader in Langfuse** Use `langfuse-prompt-management` skill to version-control the grader prompt. #### Path C: Traces but No Dataset → Curate Dataset If you have traces but no dataset: **Step C.1: Identify good candidates** ```bash # Find diverse, representative traces python3 ~/.codex/skills/langfuse-data-retrieval/scripts/trace_retriever.py \ --last 50 --mode minimal ``` Identify: - Different input types/categories - Edge cases - Known failure modes - Representative production traffic **Step C.2: Create dataset** ```python from langfuse import Langfuse lf = Langfuse() # Create dataset dataset = lf.create_dataset( name="-eval-v1", description="Initial evaluation dataset for ", metadata={"source": "bootstrap", "created": ""} ) # Add items from selected traces for trace_id in selected_traces: trace = lf.get_trace(trace_id) dataset.create_item( input=trace.input, expected_output=trace.output, # Or manually corrected output metadata={"source_trace": trace_id} ) ``` Target: 20-30 diverse items for initial dataset. #### Path D: Freeform Feedback → Structure First If you have freeform feedback (Slack, tickets, etc.): **Step D.1: Categorize feedback** ``` Common categories: | Category | Description | Example | |----------|-------------|---------| | Accuracy | Wrong information | "It gave me the wrong date" | | Completeness | Missing information | "It didn't mention X" | | Tone | Style/voice issues | "Too formal for our brand" | | Format | Output structure | "Should be bullet points" | | Hallucination | Made-up content | "That feature doesn't exist" | Review your feedback and tag each item with a category. ``` **Step D.2: Find corresponding traces** For each feedback item, locate the trace: ```bash # Search by time range or user python3 ~/.codex/skills/langfuse-data-retrieval/scripts/trace_retriever.py \ --last 20 --filter-field user_id --filter-value "" --mode minimal ``` **Step D.3: Create annotations** Convert freeform feedback to structured annotations: ```bash python3 ~/.codex/skills/langfuse-annotation-manager/scripts/annotation_manager.py \ create-score \ --trace-id "" \ --name "quality" \ --value \ --comment "" ``` Then proceed to Path B (build graders from annotations). ### Step 3: Validate Setup Once graders and dataset are ready, validate: **Run a test evaluation:** ```python from langfuse import Langfuse lf = Langfuse() dataset = lf.get_dataset("") # Run agent on 5 items for item in dataset.items[:5]: output = run_agent(item.input) # Score with new grader # Compare to expected_output ``` **Check grader alignment:** Compare automated scores to human annotations: - Do they correlate? - Are there systematic disagreements? - Adjust grader if needed ### Step 4: Transition to Full Loop Once validated, update or create the optimization journal: ```yaml meta: agent_name: "" bootstrap_completed: "" bootstrap_source: "human_annotations|traces|freeform" dataset: name: "" items: graders: - name: "" type: "llm_judge" derived_from: "human_annotations" current_phase: "init" ``` Report: ``` Bootstrap complete! You now have: - Dataset: ( items) - Grader: - Baseline ready to establish You're ready for the full optimization loop. Proceed to Phase 0: INITIALIZE to establish baseline. ``` ### Bootstrap Output Format Track progress clearly: ``` ## Bootstrap Progress: ### Assessment - Traces: ✓ Available - Human feedback: ✓ Annotations in Langfuse - Dataset: ✗ Needs creation - Graders: ✗ Needs creation ### Completed 1. ✓ Exported 45 human annotations 2. ✓ Analyzed score patterns 3. ✓ Created grader prompt 4. → Creating dataset... ### Next Step Select 20-30 traces for initial dataset. [Continue] [Pause for now] ``` ### Bootstrap Error Handling - **No traces found:** Guide to `langfuse-instrumentation-setup` skill - **Insufficient annotations:** Suggest annotation workflow first - **Grader doesn't align:** Iterate on grader prompt with more examples - **Dataset too small:** Help identify more traces to add --- ## Phase 0: INITIALIZE **Goal:** Establish baseline and create optimization infrastructure. **Reference:** `references/journal-schema.md` ### Prerequisites Check Before starting, verify: 1. **Langfuse tracing is active** - Agent produces traces in Langfuse - Key steps are instrumented (LLM calls, tools, decisions) 2. **Evaluation dataset exists** (or can be created) - Existing dataset with 20+ items, OR - Production traces to curate from, OR - Ability to create synthetic test cases 3. **Success criteria are clear** - Primary metric to optimize (accuracy, task completion, etc.) - Target value for that metric - Constraints that must not regress (latency, cost, etc.) ### Step 1: Discover the Agent Gather information about the agent: ``` Questions to answer: - What is the agent's purpose? - Where is the code? (path to entry point) - How do you run it? (command, API call, etc.) - What prompts does it use? (file paths or Langfuse prompt names) - What tools does it have? - What are known failure modes? ``` If user doesn't know, help them explore the codebase to find this information. ### Step 2: Confirm Target and Constraints Ask user to confirm: ``` Target: - Metric: - Current estimate: - Goal: Constraints (must not regress): - - ``` ### Step 3: Establish Baseline Run initial evaluation to measure current state: **If dataset exists:** ```python # Run baseline experiment from langfuse import Langfuse lf = Langfuse() dataset = lf.get_dataset("") # Run agent on each item, record scores # Name the run "baseline" ``` **If no dataset:** 1. Fetch recent production traces 2. Curate 20-30 diverse cases 3. Add expected outputs where determinable 4. Create dataset in Langfuse Record baseline metrics: - Primary metric value - Constraint metric values - Pass rate distribution - Notable failure patterns (for first hypothesis) ### Step 4: Create Journal Create the optimization journal at `.claude/optimization-loops//journal.yaml`: ```yaml meta: agent_name: "" agent_path: "" entry_point: "" started: "" target: metric: "" current: goal: constraints: - metric: "" limit: "" - metric: "" limit: "" baseline: : : dataset: "" run_name: "baseline" date: "" current_phase: "hypothesize" current_iteration: 0 iterations: [] learnings: what_works: [] what_fails: [] patterns_discovered: [] dataset_history: - iteration: 0 items_count: source: "initial" ``` ### Step 5: Transition Update journal: `current_phase: "hypothesize"` Report to user: ``` Baseline established: - : (target: ) - Dataset: ( items) Ready to begin optimization. First, we'll analyze failures and formulate a hypothesis for improvement. ``` --- ## Phase 1: HYPOTHESIZE **Goal:** Formulate a specific, testable improvement hypothesis. **Reference:** `references/hypothesis-patterns.md` ### Step 1: Review Current State From journal, understand: - Current metrics vs target (gap to close) - Previous iteration results (if any) - Accumulated learnings (what's worked/failed before) ``` Gap Analysis: - accuracy: 72% → 90% (need +18%) - Previous iteration: +6% from reasoning step - Remaining gap: 12% ``` ### Step 2: Identify Improvement Opportunity Analyze failures to find highest-impact opportunity: **Get failure data:** ```python # Fetch low-scoring traces from latest run from langfuse import Langfuse lf = Langfuse() traces = lf.fetch_traces( filter={ "scores": [{"name": "", "operator": "<", "value": }] }, limit=20 ) ``` **Categorize failures:** - Group by failure type/pattern - Count frequency of each pattern - Estimate impact if fixed **Prioritize by:** 1. **Frequency** — How often does this fail? 2. **Impact** — How much would fixing it improve target metric? 3. **Tractability** — Can we realistically fix it this iteration? ### Step 3: Formulate Hypothesis Structure your hypothesis: ``` HYPOTHESIS TEMPLATE: IF we [specific change] THEN [target metric] will improve by [expected amount] BECAUSE [reasoning based on failure analysis] RISK: [what might get worse] EVIDENCE: [failure cases that support this hypothesis] ``` **Example:** ``` IF we add explicit tool invocation guidance for math queries THEN accuracy will improve by ~10% BECAUSE 18/50 failures (36%) are math queries where the calculator tool wasn't used despite being available RISK: May increase prompt length and latency slightly EVIDENCE: Items 12, 23, 34, 41 all show reasoning attempts instead of calculator usage for arithmetic ``` **Hypothesis quality checklist:** - [ ] Specific change identified (not vague "make it better") - [ ] Expected impact quantified - [ ] Based on actual failure data - [ ] Single change (not multiple changes bundled) - [ ] Testable with current evaluation setup ### Step 4: Design the Change Specify exactly what will change: ```yaml change: type: prompt | tool | architecture | retrieval location: > description: # For prompt changes: section: addition: # For code changes: file: function: modification: ``` **Rollback plan:** - How to undo if results are negative - For Langfuse prompts: revert to previous version - For code: git revert or manual undo ### Step 5: Update Journal Add new iteration to journal: ```yaml iterations: - id: started: "" hypothesis: statement: "" expected_impact: "" risk: "" rationale: "" evidence: - "" - "" change: type: "" location: "" description: "" experiment: null results: null analysis: null compounded: null ``` Update: `current_phase: "experiment"`, `current_iteration: ` ### Step 6: Confirm with User Before proceeding: ``` Proposed Hypothesis (Iteration ): Change: - Type: - Location: - Description: Expected: Risk: Ready to implement and test this hypothesis? [Yes, proceed] [Let me refine it] [Different hypothesis] ``` --- ## Phase 2: EXPERIMENT **Goal:** Implement the change and run controlled evaluation. **Reference:** `references/experiment-design.md` ### Step 1: Implement the Change Based on change type: **Prompt change (Langfuse):** ```python from langfuse import Langfuse lf = Langfuse() # Get current prompt current = lf.get_prompt("", label="production") # Create new version new_content = """""" lf.create_prompt( name="", prompt=new_content, config=current.config, labels=[f"experiment-v{iteration}"] ) ``` **Prompt change (local file):** - Edit the file directly - Keep backup of original - Document exact changes **Code change:** - Make minimal, isolated edit - Ensure no side effects - Test that agent still runs ### Step 2: Smoke Test Before full experiment, verify change is active: ``` Smoke test checklist: [ ] Agent runs without errors [ ] Change is visible in behavior [ ] Single test case shows expected difference [ ] No obvious regressions on simple case ``` Run one or two items manually and inspect traces. ### Step 3: Run Experiment Execute full evaluation: ```python from langfuse import Langfuse lf = Langfuse() dataset = lf.get_dataset("") run_name = f"v{iteration}-{hypothesis_slug}" # Create dataset run run = dataset.create_run( name=run_name, metadata={ "iteration": iteration, "hypothesis": "", "change": "" } ) # Execute on each item for item in dataset.items: # Run your agent output = run_agent(item.input) # Log to run run.log( input=item.input, output=output, expected_output=item.expected_output ) ``` **Scoring:** Ensure judges/evaluators run on outputs (via Langfuse or custom). ### Step 4: Wait and Collect - Monitor experiment progress - Ensure all items complete successfully - Note any errors or issues ### Step 5: Collect Results Aggregate metrics: ```python # Get run results run = lf.get_dataset_run("", "") results = { "accuracy": run.aggregate_score("accuracy"), "latency_p95": run.aggregate_score("latency", percentile=95), # ... other metrics } # Calculate deltas previous = journal.iterations[-1].results if journal.iterations else journal.meta.baseline deltas = {k: results[k] - previous[k] for k in results} ``` ### Step 6: Update Journal ```yaml experiment: run_name: "v-" dataset: "" date: "" items_run: results: accuracy: latency_p95: cost_avg: delta: accuracy: <+/- value> latency_p95: <+/- value> ``` Update: `current_phase: "analyze"` ### Step 7: Report ``` Experiment Complete: v- Results: | Metric | Baseline | Previous | Current | Delta | |--------|----------|----------|---------|-------| | accuracy | 72% | 78% | 81% | +3% | | latency | 2.1s | 2.4s | 2.3s | -0.1s | items evaluated. Ready to analyze results. ``` --- ## Phase 3: ANALYZE **Goal:** Determine if hypothesis was validated and extract learnings. **Reference:** `references/analysis-framework.md` ### Step 1: Compare Results Create comparison table: | Metric | Baseline | Previous | Current | Delta | Target | Status | |--------|----------|----------|---------|-------|--------|--------| | accuracy | 72% | 78% | 81% | +3% | 90% | Gap: 9% | | latency | 2.1s | 2.4s | 2.3s | -0.1s | <3s | ✓ | | cost | $0.015 | $0.018 | $0.017 | -$0.001 | <$0.02 | ✓ | ### Step 2: Validate Hypothesis Answer these questions: 1. **Did target metric improve?** - Yes: How much? As expected? - No: Stayed flat or regressed? 2. **Were constraints maintained?** - Any constraint violations? - Close to any limit? 3. **Was improvement as expected?** - Met expectations: Hypothesis validated - Some improvement but less: Partially validated - No improvement: Hypothesis invalidated - Regression: Hypothesis backfired **Verdict:** `validated | partially_validated | invalidated | backfired` ### Step 3: Investigate Failures Even if metrics improved, failures remain. Investigate them: **Retrieve failures:** ```python # Get items that still failed failures = [item for item in run.items if item.scores["accuracy"] < threshold] ``` **For deep analysis, use the optimization-analyst agent:** ``` Launch optimization-analyst agent to investigate: - Run name: - Failure items: - Question: What patterns exist in remaining failures? ``` **Or manual investigation:** For top 3-5 failures: 1. Retrieve full trace 2. Walk through execution step by step 3. Identify where it went wrong 4. Categorize the failure type ### Step 4: Pattern Extraction Group failures by pattern: ```yaml failure_patterns: - pattern: "Context truncation" count: 4 description: "Long inputs cause important context to be truncated" root_cause: "System prompt + input exceeds context window" affected_items: ["item_12", "item_23", "item_45", "item_67"] potential_fix: "Implement chunking or summarization" - pattern: "Tool output parsing" count: 2 description: "Agent misinterprets tool response format" root_cause: "Tool returns nested JSON, agent expects flat" affected_items: ["item_34", "item_56"] potential_fix: "Add output format example to tool description" ``` ### Step 5: Good vs Bad Comparison For each failure pattern, find contrast: 1. A failed trace 2. A successful trace with similar input Compare: - What steps were taken? - Where did paths diverge? - What was the key difference? ``` COMPARISON: Context Truncation Pattern Failed (item_12): Successful (item_08): ───────────────────────────────── ───────────────────────────────── Input: 2000 words Input: 500 words Prompt: 1500 tokens Prompt: 1500 tokens Total: ~3500 tokens Total: ~2000 tokens Step 1: Parse input (truncated!) Step 1: Parse input (complete) Step 2: Missing key details Step 2: All details available Step 3: Wrong answer Step 3: Correct answer DIVERGENCE: Input truncation at context limit KEY INSIGHT: Long inputs need preprocessing ``` ### Step 6: Synthesize Findings Document key findings: ```yaml analysis: hypothesis_validated: true verdict: "Partially validated - accuracy improved but less than expected" metrics_summary: accuracy: "+3% (expected +10%)" latency: "-0.1s (improved)" key_findings: - "Math query accuracy improved from 64% to 85%" - "Improvement less than expected due to context truncation issues" - "4 failures due to truncation, unrelated to math guidance" failure_patterns: - pattern: "Context truncation" count: 4 priority: "high" next_action: "Address in next iteration" unexpected_observations: - "Latency improved despite longer prompt (fewer retries?)" recommendations: - priority: 1 action: "Implement input chunking for long queries" expected_impact: "+5% accuracy" - priority: 2 action: "Add output format examples to tool descriptions" expected_impact: "+2% accuracy" ``` ### Step 7: Update Journal Update iteration record with full analysis, then: `current_phase: "compound"` --- ## Phase 4: COMPOUND **Goal:** Capture learnings and grow the evaluation system. **Reference:** `references/compounding-strategies.md` ### Step 1: Dataset Growth Add failure cases as new test items: ```python from langfuse import Langfuse lf = Langfuse() dataset = lf.get_dataset("") for failure in new_failure_cases: dataset.create_item( input=failure.input, expected_output=failure.expected_output, metadata={ "source": f"iteration-{iteration}", "failure_pattern": failure.pattern, "added_date": today } ) ``` **What to add:** - Failure cases from this iteration (ensure they have clear expected outputs) - Edge cases discovered during analysis - Variations that might stress-test the fix **Track growth:** ```yaml dataset_history: - iteration: items_added: source: "failure_cases" patterns_covered: ["context_truncation", "tool_parsing"] ``` ### Step 2: Judge Calibration Review judge accuracy this iteration: **Check for false positives:** - Items scored high but were actually bad - Judge criteria too lenient? **Check for false negatives:** - Items scored low but were actually good - Judge criteria too strict or wrong? **If calibration issues found:** ```python # Update judge prompt lf.create_prompt( name="judge-accuracy", prompt="", labels=[f"calibrated-v{iteration}"] ) ``` ### Step 3: Prompt Management Based on results: **If improvement confirmed:** ```python # Promote experiment prompt to production lf.update_prompt_labels( name="", version=experiment_version, labels=["production"] # Add production label ) ``` **If regression or no improvement:** ```python # Keep previous production version # Archive experiment version for reference lf.update_prompt_labels( name="", version=experiment_version, labels=["archived-v{iteration}"] ) ``` ### Step 4: Capture Learnings Update journal learnings section: ```yaml learnings: what_works: - "Explicit tool guidance improves tool selection by ~20%" - "Step-by-step reasoning helps complex multi-part queries" # ... accumulated from all iterations what_fails: - "Generic 'be thorough' instructions increase latency without quality gain" - "More than 3 few-shot examples causes confusion" # ... accumulated from all iterations patterns_discovered: - "Math queries need calculator tool, not LLM reasoning" - "Context truncation starts around 6k tokens" # ... patterns found during analysis architectural_insights: - "Consider input preprocessing for production" - "Tool descriptions need output format examples" ``` ### Step 5: Decide Next Action Based on current state: | Situation | Decision | Next Phase | |-----------|----------|------------| | Target met | **Graduate** | Exit loop | | Good progress + clear next hypothesis | **Continue** | HYPOTHESIZE | | 3+ iterations with no progress | **Pivot** | Rethink approach | | Regression | **Rollback** | Revert + analyze | **Continue criteria:** - Gap to target still exists - Clear hypothesis for next iteration - Progress trend is positive **Graduate criteria:** - Target metric achieved - Constraints satisfied - Results stable (pass^k > threshold) **Pivot criteria:** - Stuck despite multiple attempts - Fundamental approach may be wrong - Need different strategy (model, architecture, etc.) ### Step 6: Prepare Next Iteration (if continuing) If decision is "continue": ```yaml compounded: dataset_items_added: 4 judge_updated: false prompt_promoted: true learnings_captured: true decision: "continue" next_hypothesis_direction: "Address context truncation for long inputs" ``` Update: `current_phase: "hypothesize"` ### Step 7: Or Graduate (if target met) If decision is "graduate": ```yaml compounded: dataset_items_added: 2 final_state: accuracy: 0.91 latency_p95: 2.3 cost_avg: 0.016 decision: "graduate" summary: "Target achieved after 4 iterations. Key improvements: tool guidance, reasoning steps, input chunking." ``` Update: `current_phase: "graduated"` Create graduation report: ``` ## Optimization Complete: **Target:** accuracy 90% ✓ Achieved: 91% **Iterations:** 4 **Duration:** 2 weeks ### Journey 1. Baseline: 72% 2. +Reasoning steps: 78% (+6%) 3. +Tool guidance: 81% (+3%) 4. +Input chunking: 91% (+10%) ### Key Learnings ### Dataset Growth 50 → 68 items (+36%) ### Recommendations for Maintenance - Monitor accuracy weekly - Re-run regression suite before prompt changes - Watch for new failure patterns in production ``` --- ## State Recovery If optimization is interrupted, state recovery is automatic: 1. Read journal 2. Check `current_phase` and `current_iteration` 3. Look at what's populated in current iteration 4. Continue from first null/incomplete field **Recovery scenarios:** | Journal State | Recovery Action | |---------------|-----------------| | Phase: hypothesize, hypothesis: null | Start hypothesis formulation | | Phase: hypothesize, hypothesis: filled | Confirm and move to experiment | | Phase: experiment, experiment: null | Implement change and run | | Phase: experiment, results: null | Collect results | | Phase: analyze, analysis: null | Perform analysis | | Phase: compound, compounded: null | Complete compounding steps | --- ## Integration Notes ### Codex Integrations Use these Codex skills to execute operational steps: - `langfuse-instrumentation-setup` for tracing prerequisites - `evaluation-design` for metrics, datasets, and grading plans - `langfuse-dataset-setup` for dataset + judge configuration - `langfuse-dataset-management` for dataset curation and growth - `langfuse-experiment-runner` for controlled eval runs - `langfuse-score-analytics` for metric trends and regressions - `langfuse-trace-analysis` for root-cause analysis - `langfuse-prompt-management` for prompt versioning and promotion - `langfuse-annotation-manager` for human review workflows ### Langfuse Operations This skill uses Langfuse for: - **Traces:** Debugging and failure analysis - **Datasets:** Test cases and regression suite - **Prompts:** Version-controlled agent prompts - **Experiments:** Controlled evaluation runs - **Scores:** Metrics and judge outputs If Langfuse MCP server is available, use it. Otherwise, use Python SDK directly. ### File Locations All state in `.claude/optimization-loops//`: ``` journal.yaml # Central state machine iterations/ # Detailed iteration records (optional) 001-reasoning.md # Narrative record of iteration 1 002-tools.md # Narrative record of iteration 2 ``` ### Asking Questions Use AskUserQuestion at key decision points: - Confirming target and constraints (INITIALIZE) - Approving hypothesis before experiment (HYPOTHESIZE) - Confirming compounding decisions (COMPOUND) - Deciding continue vs graduate (COMPOUND)