# apply-observability-patterns > Apply modern observability patterns (structured logs with trace correlation, RED metrics, OpenTelemetry spans, dashboards/alerts). Use when adding or changing logs/metrics/traces instrumentation, defining telemetry field contracts, or creating verification steps/runbooks for production debugging in enterprise web apps. - Author: Brice Rising - Repository: bricerising/enterprise-software-playbook - Version: 20260201094544 - Stars: 1 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/bricerising/enterprise-software-playbook - Web: https://mule.run/skillshub/@@bricerising/enterprise-software-playbook~apply-observability-patterns:20260201094544 --- --- name: apply-observability-patterns description: Apply modern observability patterns (structured logs with trace correlation, RED metrics, OpenTelemetry spans, dashboards/alerts). Use when adding or changing logs/metrics/traces instrumentation, defining telemetry field contracts, or creating verification steps/runbooks for production debugging in enterprise web apps. --- # Apply Observability Patterns ## Overview Make observability consistent and actionable: every boundary emits traces, metrics, and structured logs that correlate via IDs and stable fields. This is intentionally opinionated: you should be able to answer “what happened?” with **log → trace → metrics** within a minute. ## Workflow 1. Define the **unit of work** (one trace): HTTP request, gRPC call, job run, queue message, WebSocket action. 2. Instrument end-to-end: - traces: spans around the unit of work + key downstream calls - metrics: RED for the boundary + a few domain metrics - logs: structured JSON that includes correlation IDs 3. Declare the field contract (stable keys). 4. Add guardrails (PII rules, label cardinality rules, sampling/log levels). 5. Verify correlation in a failure case (error log includes `traceId`; trace contains downstream spans; metrics show error rate). ## Chooser (What To Instrument) Start with the user-impact boundaries: - **HTTP handlers**: one root span per request + RED metrics per route template. - **gRPC methods**: one root span per RPC + RED metrics per service/method. - **DB/cache clients**: child spans per query/command; include target system and operation. - **Async jobs / schedulers**: one root span per run; metrics for runs/success/failure/duration. - **Event consumers**: one root span per message (or per batch); include message type and dedupe/idempotency metadata. - **WebSockets**: session context + per-action spans; metrics for connections, messages, disconnect reasons. ## Field Contract (Opinionated Defaults) ### Logs (structured JSON) Include these keys where applicable: - `service`: stable service/app identifier - `env`: environment (local/dev/staging/prod) - `traceId`, `spanId`: correlation IDs (when tracing exists) - `requestId`: if you use a separate request ID (often equals `traceId`) - `op`: operation name (route template, RPC method, job name) - `userId` / `actorId`: only if policy allows; never as a metric label - `durationMs`: for timing logs (prefer metrics for aggregates) - `err`: structured error (`type`/`code`, message, stack for unknown failures) ### Spans (traces) - Name spans by operation (`HTTP GET /api/foo`, `grpc PlayerService/GetProfile`, `redis GET gateway:...`). - Set attributes for routing and outcome (status code, error code, retry count). - Prefer stable, low-cardinality attributes; avoid raw request bodies. ### Metrics (RED + domain) - **RED** for each boundary (per route/RPC): request count, error count, duration histogram. - Add a few **domain metrics** that align with product intent (tables created, orders completed, etc.). - Avoid high-cardinality labels (no `userId`, no unbounded IDs); use logs/traces for per-entity detail. ## Guardrails (Prevent “Telemetry Debt”) - **Cardinality discipline**: metric label values must be bounded sets; default to route templates, not raw URLs. - **PII discipline**: never log secrets; be explicit about what IDs are safe to log. - **Log once**: avoid logging the same error in every layer; log at the boundary with enough context. - **Sample intentionally**: if you sample traces, keep error traces at higher priority. - **Always end spans**: long-running work should have explicit shutdown and cancellation semantics. ## Minimal TypeScript Snippet (Trace IDs in Logs) If you use OpenTelemetry, you can enrich logs with the active span context: ```ts import { context, trace } from '@opentelemetry/api'; export function getTraceLogFields(): { traceId?: string; spanId?: string } { const span = trace.getSpan(context.active()); if (!span) return {}; const { traceId, spanId } = span.spanContext(); return { traceId, spanId }; } ``` ## Testing / Verification - Exercise a failing request and verify: - the error log includes `traceId` - the trace contains downstream span(s) - boundary RED metrics reflect the error - Prefer consumer-visible tests for behavior; treat telemetry verification as a local/dev smoke check unless the project already has telemetry assertions. ## References - Deeper checklists: [`references/checklists.md`](references/checklists.md) - Boundary tests: [`consumer-test-coverage`](../consumer-test-coverage/SKILL.md) - Typed errors + explicit lifetimes: [`typescript-style-guide`](../typescript-style-guide/SKILL.md) ## Output Template When applying this skill, return: - The instrumentation plan (which boundaries, what telemetry, what fields). - The minimal code changes (where to start spans, where to log, what metrics to add). - The verification steps (how to reproduce and correlate log → trace → metrics).