Observability Done Right: Metrics, Logs, and Traces

Monitoring tells you when something is broken. Observability tells you why. The distinction matters: monitoring checks predefined conditions (CPU > 80%, error rate > 1%), while observability lets you ask arbitrary questions about your system's behavior. When a new, unexpected failure mode appears — and it will — observability is what lets you diagnose it without deploying new instrumentation.

Monitoring dashboard with metrics — The three pillars of observability: metrics for the what, logs for the context, traces for the path

The Three Pillars and How They Connect

Metrics, logs, and traces aren't three independent systems — they're three views of the same underlying reality. The power of observability comes from connecting them: a metric alert leads you to relevant logs, which contain trace IDs that show you the exact request path through your distributed system.

Metrics: Aggregated numeric data over time. Low cardinality, cheap to store, good for alerting and dashboards. Example: request_duration_seconds, error_count_total.
Logs: Discrete events with context. High cardinality, expensive to store, good for debugging specific incidents. Example: structured JSON log with request_id, user_id, error details.
Traces: Request-scoped timelines across services. Shows the full path of a request, including latency at each hop. Essential for debugging distributed systems.

OpenTelemetry: The Universal Standard

Stop building vendor-specific instrumentation. OpenTelemetry (OTel) is the CNCF standard for generating and collecting telemetry data, and it's supported by every major observability platform. Instrument once with OTel, export to any backend (Grafana, Datadog, New Relic, Jaeger, etc.).

tracing/init.ts

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';

const sdk = new NodeSDK({
  serviceName: process.env.SERVICE_NAME,
  traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4318/v1/traces' }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: 'http://otel-collector:4318/v1/metrics' }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Alerting Philosophy: Alert on Symptoms, Not Causes

Alert on what users experience (error rate, latency, availability), not on internal metrics (CPU, memory, disk). Users don't care if your CPU is at 90% — they care if the page loads slowly. Symptom-based alerts reduce noise (fewer false positives) and catch issues regardless of the underlying cause.

For every alert, document three things: what it means, what to check first, and what to do if it fires. Alert fatigue kills on-call engineers — every alert should be actionable.

“Observability is not a product you buy — it's a practice you build. The tools are just enablers. The real investment is in instrumentation discipline, structured logging standards, and a culture that treats observability as a first-class engineering concern.”
— Marcus Rodriguez, Vaarak DevOps