# observability-triage

> Triage production/local issues using a log → trace → metrics workflow (HTTP/gRPC/async consumers). Use when debugging incidents, regressions, or SLO violations in an already-instrumented enterprise web app; not for adding new instrumentation.

- Author: Brice Rising
- Repository: bricerising/enterprise-software-playbook
- Version: 20260201094544
- Stars: 1
- Forks: 0
- Last Updated: 2026-02-06
- Source: https://github.com/bricerising/enterprise-software-playbook
- Web: https://mule.run/skillshub/@@bricerising/enterprise-software-playbook~observability-triage:20260201094544

---

---
name: observability-triage
description: Triage production/local issues using a log → trace → metrics workflow (HTTP/gRPC/async consumers). Use when debugging incidents, regressions, or SLO violations in an already-instrumented enterprise web app; not for adding new instrumentation.
---

# Observability Triage (Log → Trace → Metrics)

## Overview

This skill is for **debugging** with existing telemetry. It does **not** focus on adding instrumentation (use `apply-observability-patterns` when telemetry gaps block triage).

Goal: turn “something is broken/slow” into:

- a concrete **symptom + impact** statement,
- an **evidence-backed hypothesis** (or a small set of competing ones),
- a **mitigation** (rollback/flag/scale) when needed,
- a short list of **fix + follow-up** tasks.

## Workflow

### 0) Establish ground truth (2–5 minutes)

Capture:

- Environment (`local`/`dev`/`staging`/`prod`) and time window (start/end).
- Symptom (what’s failing/slow) and impact (SLO/user-visible blast radius).
- One **exemplar**: request/trace ID, job run ID, message ID, or timestamped log line.

### 1) Logs (find the exemplar and its correlation IDs)

1. Find the first error/timeout log line closest to the symptom window.
2. Identify correlation keys (prefer stable IDs):
   - `traceId`, `requestId`, `spanId`
   - `op` (route template / RPC method / job name / message type)
   - error code/type (typed error envelope, gRPC status, HTTP status)
3. Pull the **full log story** for the exemplar (start → downstream call(s) → failure).

Copy/paste helpers live in `references/commands.md`.

### 2) Trace (turn the exemplar into a dependency hypothesis)

If you have a `traceId`, use it.

1. Open the trace and confirm the root span matches the suspected operation (`op`).
2. Identify:
   - the slowest span(s),
   - the first error span(s),
   - retries (multiple similar child spans),
   - deadline/time budget signals (deadline exceeded, timeout errors).
3. Convert that to a dependency statement:
   - “`service A` is timing out calling `service B` method `X`”
   - “DB query `Y` is slow / missing index / deadlocked”
   - “Queue consumer is failing on message type `T` (poison message)”

If you cannot find/interpret traces, fall back to logs + metrics and consider adding missing telemetry via `apply-observability-patterns`.

### 3) Metrics (confirm blast radius + regression)

Use metrics to answer:

- Is this widespread or isolated to one tenant/route/method?
- Is it a new regression (deploy-correlated) or a gradual degradation (resource/saturation)?
- Is it primarily errors or latency?

Start with RED for the boundary (HTTP route / gRPC method / consumer group).

### 4) Decide: mitigate vs investigate

If impact is high and evidence points to a recent change:

- rollback / disable flag / reduce load / scale critical dependency

If impact is moderate or unclear:

- tighten the hypothesis with 1–2 targeted checks (another exemplar trace, compare two instances, check downstream health)

### 5) Capture learnings (don’t lose the fix)

If you found a systemic gap, capture it:

- missing telemetry field contracts → [`apply-observability-patterns`](../apply-observability-patterns/SKILL.md)
- retries without idempotency / missing time budgets → [`apply-resilience-patterns`](../apply-resilience-patterns/SKILL.md)
- repeated boundary logic across services → [`shared-platform-library`](../shared-platform-library/SKILL.md)
- cross-service pattern confusion → [`select-architecture-pattern`](../select-architecture-pattern/SKILL.md)

## Guardrails

- Don’t log secrets/PII while triaging (even “temporarily”).
- Don’t use unbounded IDs as metric labels; use logs/traces for per-entity investigation.
- Don’t add retries as a debugging “fix” without idempotency/dedupe.
- Prefer a small number of exemplars (2–3) over “grep everything forever”.

## References

- Copy/paste commands: [`references/commands.md`](references/commands.md)
- Scenario checklists (HTTP/gRPC/consumers): [`references/scenarios.md`](references/scenarios.md)
- If telemetry is missing: [`apply-observability-patterns`](../apply-observability-patterns/SKILL.md)

## Output Template

When using this skill, return:

- **Symptom**: what is failing/slow (include concrete ops: route/method/job/message type).
- **Impact**: who/what is affected and how badly (errors %, latency p95, backlog size).
- **Time window**: start/end and whether it correlates with deploy/config change.
- **Evidence**: exemplar IDs + the key log/trace/metric observations.
- **Hypothesis**: most likely cause + 1 alternative (if applicable).
- **Mitigation**: what you did / recommend doing now (rollback/flag/scale).
- **Fix plan**: code/config changes to make it correct and durable.
- **Follow-ups**: telemetry gaps, runbook updates, tests, new invariants.