# observability-triage > Triage production/local issues using a log → trace → metrics workflow (HTTP/gRPC/async consumers). Use when debugging incidents, regressions, or SLO violations in an already-instrumented enterprise web app; not for adding new instrumentation. - Author: Brice Rising - Repository: bricerising/enterprise-software-playbook - Version: 20260201094544 - Stars: 1 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/bricerising/enterprise-software-playbook - Web: https://mule.run/skillshub/@@bricerising/enterprise-software-playbook~observability-triage:20260201094544 --- --- name: observability-triage description: Triage production/local issues using a log → trace → metrics workflow (HTTP/gRPC/async consumers). Use when debugging incidents, regressions, or SLO violations in an already-instrumented enterprise web app; not for adding new instrumentation. --- # Observability Triage (Log → Trace → Metrics) ## Overview This skill is for **debugging** with existing telemetry. It does **not** focus on adding instrumentation (use `apply-observability-patterns` when telemetry gaps block triage). Goal: turn “something is broken/slow” into: - a concrete **symptom + impact** statement, - an **evidence-backed hypothesis** (or a small set of competing ones), - a **mitigation** (rollback/flag/scale) when needed, - a short list of **fix + follow-up** tasks. ## Workflow ### 0) Establish ground truth (2–5 minutes) Capture: - Environment (`local`/`dev`/`staging`/`prod`) and time window (start/end). - Symptom (what’s failing/slow) and impact (SLO/user-visible blast radius). - One **exemplar**: request/trace ID, job run ID, message ID, or timestamped log line. ### 1) Logs (find the exemplar and its correlation IDs) 1. Find the first error/timeout log line closest to the symptom window. 2. Identify correlation keys (prefer stable IDs): - `traceId`, `requestId`, `spanId` - `op` (route template / RPC method / job name / message type) - error code/type (typed error envelope, gRPC status, HTTP status) 3. Pull the **full log story** for the exemplar (start → downstream call(s) → failure). Copy/paste helpers live in `references/commands.md`. ### 2) Trace (turn the exemplar into a dependency hypothesis) If you have a `traceId`, use it. 1. Open the trace and confirm the root span matches the suspected operation (`op`). 2. Identify: - the slowest span(s), - the first error span(s), - retries (multiple similar child spans), - deadline/time budget signals (deadline exceeded, timeout errors). 3. Convert that to a dependency statement: - “`service A` is timing out calling `service B` method `X`” - “DB query `Y` is slow / missing index / deadlocked” - “Queue consumer is failing on message type `T` (poison message)” If you cannot find/interpret traces, fall back to logs + metrics and consider adding missing telemetry via `apply-observability-patterns`. ### 3) Metrics (confirm blast radius + regression) Use metrics to answer: - Is this widespread or isolated to one tenant/route/method? - Is it a new regression (deploy-correlated) or a gradual degradation (resource/saturation)? - Is it primarily errors or latency? Start with RED for the boundary (HTTP route / gRPC method / consumer group). ### 4) Decide: mitigate vs investigate If impact is high and evidence points to a recent change: - rollback / disable flag / reduce load / scale critical dependency If impact is moderate or unclear: - tighten the hypothesis with 1–2 targeted checks (another exemplar trace, compare two instances, check downstream health) ### 5) Capture learnings (don’t lose the fix) If you found a systemic gap, capture it: - missing telemetry field contracts → [`apply-observability-patterns`](../apply-observability-patterns/SKILL.md) - retries without idempotency / missing time budgets → [`apply-resilience-patterns`](../apply-resilience-patterns/SKILL.md) - repeated boundary logic across services → [`shared-platform-library`](../shared-platform-library/SKILL.md) - cross-service pattern confusion → [`select-architecture-pattern`](../select-architecture-pattern/SKILL.md) ## Guardrails - Don’t log secrets/PII while triaging (even “temporarily”). - Don’t use unbounded IDs as metric labels; use logs/traces for per-entity investigation. - Don’t add retries as a debugging “fix” without idempotency/dedupe. - Prefer a small number of exemplars (2–3) over “grep everything forever”. ## References - Copy/paste commands: [`references/commands.md`](references/commands.md) - Scenario checklists (HTTP/gRPC/consumers): [`references/scenarios.md`](references/scenarios.md) - If telemetry is missing: [`apply-observability-patterns`](../apply-observability-patterns/SKILL.md) ## Output Template When using this skill, return: - **Symptom**: what is failing/slow (include concrete ops: route/method/job/message type). - **Impact**: who/what is affected and how badly (errors %, latency p95, backlog size). - **Time window**: start/end and whether it correlates with deploy/config change. - **Evidence**: exemplar IDs + the key log/trace/metric observations. - **Hypothesis**: most likely cause + 1 alternative (if applicable). - **Mitigation**: what you did / recommend doing now (rollback/flag/scale). - **Fix plan**: code/config changes to make it correct and durable. - **Follow-ups**: telemetry gaps, runbook updates, tests, new invariants.