Docker in Production: Lessons from Running 500+ Containers

Docker in development and Docker in production are fundamentally different. In development, you care about fast builds and easy debugging. In production, you care about security, reliability, resource efficiency, and observability. After running 500+ containers across client deployments, these are the production practices we never skip.

Container orchestration and deployment — Production containers need security hardening, health checks, and resource governance

Multi-Stage Builds for Minimal Images

Every layer in your Docker image is attack surface. Build tools, dev dependencies, and source code don't belong in production images. Multi-stage builds let you use a full build environment (Node, Go, Rust toolchain) for compilation, then copy only the built artifacts to a minimal runtime image.

Dockerfile

# Build stage — full Node environment
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Production stage — minimal image
FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production

# Security: run as non-root user
RUN addgroup -g 1001 -S app && adduser -S app -u 1001
COPY --from=builder --chown=app:app /app/.next/standalone ./
COPY --from=builder --chown=app:app /app/public ./public
COPY --from=builder --chown=app:app /app/.next/static ./.next/static

USER app
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/api/health || exit 1

CMD ["node", "server.js"]

Security Hardening Checklist

Never run as root: Always create a non-root user and switch to it with USER directive
Use distroless or Alpine base images: Fewer packages = fewer vulnerabilities. Our Next.js production images are 85MB vs 1.2GB for the default node image.
Scan images in CI: Run Trivy or Snyk container scanning before pushing to registry. Block deployments with critical/high vulnerabilities.
Pin base image digests: Use FROM node:20-alpine@sha256:abc123... instead of tags to prevent supply chain attacks via tag mutation.
No secrets in images: Use runtime environment variables or secret managers. Never COPY .env files or embed API keys.
Read-only filesystem: Mount the container filesystem as read-only and use tmpfs for directories that need write access.

Resource Limits: Don't Skip This

Every container must have CPU and memory limits. Without limits, a single runaway container can consume all host resources and crash every other container on the node. Memory limits also prevent OOM kills from affecting the wrong containers — better for one container to hit its own limit than for the kernel OOM killer to randomly terminate processes.

docker-compose.prod.yml

services:
  api:
    image: myapp/api:latest
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3

Set memory limits to 2x your application's typical usage. This provides headroom for traffic spikes without wasting resources. Monitor actual usage for 2 weeks before right-sizing.

Logging and Observability

Log to stdout/stderr — never to files inside the container. Docker captures stdout/stderr and routes it to your configured logging driver (json-file, fluentd, CloudWatch, etc.). File-based logging inside containers causes disk pressure, log rotation complexity, and makes logs inaccessible when the container crashes.

Use structured JSON logging so your log aggregator can parse fields for filtering and alerting. Include the container ID, service name, and trace ID in every log entry for correlation across a distributed system.