# observability > Structured logging, error tracking, and request tracing principles for production systems. - Author: jonathan - Repository: jcvargasGit/claude-config-software - Version: 20260203211818 - Stars: 0 - Forks: 0 - Last Updated: 2026-02-06 - Source: https://github.com/jcvargasGit/claude-config-software - Web: https://mule.run/skillshub/@@jcvargasGit/claude-config-software~observability:20260203211818 --- --- name: observability description: Structured logging, error tracking, and request tracing principles for production systems. --- # Observability Skill Apply these logging and observability principles for production-ready applications. This skill covers concepts - see `lang-*` skills for language-specific implementations. ## Core Principles ### Structured Logging Use structured (JSON) logging instead of plain text: **Why:** - Machine-parseable (CloudWatch Insights, Datadog, etc.) - Consistent format across services - Enables filtering and aggregation - Better for distributed tracing **What to include:** | Field | Purpose | Example | |-------|---------|---------| | timestamp | When it happened | ISO 8601 format | | level | Severity | info, warn, error | | message | What happened | "request handled" | | service | Which service | "auth", "users" | | stage | Environment | "dev", "prod" | | requestId | Correlation | UUID from request | ### Log Levels Use appropriate log levels: | Level | When to Use | Examples | |-------|-------------|----------| | **debug** | Development troubleshooting | Variable values, flow tracing | | **info** | Normal operations | Request completed, job started | | **warn** | Recoverable issues, client errors | 4xx responses, validation failures | | **error** | System failures, unexpected errors | 5xx responses, external service down | | **fatal** | Unrecoverable, app must stop | Startup failure, missing config | ### HTTP Response to Log Level Mapping | Status Code | Log Level | Rationale | |-------------|-----------|-----------| | 2xx | info | Normal operation | | 4xx | warn | Client error, not our fault | | 5xx | error | Server error, needs attention | ## Error Logging ### What to Log on Errors Always include context when logging errors: | Field | Purpose | |-------|---------| | error | The error message/type | | operation | What was being attempted | | status | HTTP status code | | userId | Who was affected (if applicable) | | input | Sanitized input that caused error | ### Error Logging Pattern ``` On error: 1. Determine severity (4xx = warn, 5xx = error) 2. Log with context (operation, status, error) 3. Return appropriate response to client ``` ### What NOT to Log - Passwords or tokens - Full credit card numbers - Personal identifiable information (PII) - Session secrets - API keys ## Request Tracing ### Correlation IDs Every request should have a unique identifier that flows through all services: 1. **Generate or extract** ID at entry point (API Gateway, load balancer) 2. **Propagate** ID to all downstream calls 3. **Include** ID in all log entries 4. **Return** ID in error responses (helps debugging) ### Common Header Names | Header | Description | |--------|-------------| | X-Request-ID | Generic request identifier | | X-Correlation-ID | Cross-service trace identifier | | X-Amzn-Trace-Id | AWS X-Ray trace ID | ### Trace Context When calling downstream services, propagate: - Request ID - Parent span ID (for distributed tracing) - User context (if applicable) ## Service Context ### What to Include Add consistent context to all logs: | Field | Source | Purpose | |-------|--------|---------| | service | Configuration | Identify which service | | stage | Environment variable | Identify environment | | version | Build info | Identify deployment | | region | Environment | Geographic location | ### Initialize Once Set service context at startup, not per-request. This reduces overhead and ensures consistency. ## Logging Patterns ### Request Logging Log at the start and end of request processing: **Entry log (optional, can be noisy):** - method, path - requestId **Exit log (always):** - method, path, status - duration - requestId ### Error Response Logging When returning an error response: ``` 1. Map error to status code 2. Determine log level (warn for 4xx, error for 5xx) 3. Log with: error, operation, status 4. Return sanitized error to client ``` ### Background Job Logging For async processes: - Log job start with job ID - Log significant milestones - Log completion with duration - Log failures with full context ## Observability Checklist - [ ] Using structured (JSON) logging - [ ] Log levels used correctly (warn for 4xx, error for 5xx) - [ ] Errors logged with context (operation, status) - [ ] Request ID propagated and logged - [ ] Service context included in all logs - [ ] Sensitive data excluded from logs - [ ] Request duration tracked - [ ] Logs parseable by monitoring tools ## Anti-Patterns ### Don't Do This - **Logging sensitive data**: Passwords, tokens, PII - **Inconsistent formats**: Mixing JSON and plain text - **Missing context**: Logging errors without operation info - **Wrong levels**: Using error for validation failures - **Silent failures**: Catching errors without logging - **Over-logging**: Debug level in production - **Under-logging**: No visibility into failures ### Do This Instead - **Structured JSON logs**: Consistent, parseable format - **Context on errors**: Always include operation, status - **Appropriate levels**: Warn for client errors, error for server errors - **Correlation IDs**: Trace requests across services - **Sanitized output**: Remove sensitive data before logging ## Language-Specific Implementation See language-specific skills for implementation details: - `lang-golang` - zerolog patterns - `lang-typescript` - pino/winston patterns - `lang-python` - structlog patterns ## Integration with Cloud Services ### AWS CloudWatch - Logs automatically captured from Lambda - Use CloudWatch Insights for querying JSON logs - Set up metric filters for error rates - Create dashboards for key metrics ### Alerting Set up alerts for: - Error rate spikes - High latency percentiles (p95, p99) - Failed health checks - Resource exhaustion (memory, connections)