Skip to main content

Optimization Strategy Framework

Random optimization leads to random results. This lesson provides a structured framework for identifying bottlenecks, making controlled changes, and proving improvement with data.

The Four-Step Framework

flowchart LR
A["1. Baseline<br/>Measure current state"] --> B["2. Identify<br/>Find the bottleneck"]
B --> C["3. Optimize<br/>Apply one change"]
C --> D["4. Validate<br/>Prove improvement"]
D -->|"Better"| E["Standardize"]
D -->|"Worse/Same"| F["Rollback"]

style A fill:#e3f2fd,stroke:#1565c0
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#e8f5e9,stroke:#2e7d32
style D fill:#f3e5f5,stroke:#7b1fa2
style F fill:#ffebee,stroke:#c62828

Step 1: Baseline

Before changing anything, capture metrics:

# Image sizes
docker images --format 'table {{.Repository}}\t{{.Tag}}\t{{.Size}}'

# Disk usage
docker system df

# Container resource usage
docker stats --no-stream

# Build time
time docker build -t app:baseline .

# Startup time (time from docker run to health check passing)
time docker compose up -d && \
while ! docker inspect app --format '{{.State.Health.Status}}' | grep -q healthy; do sleep 1; done

Save these numbers. Without a baseline, you cannot prove improvement.

Step 2: Identify the Bottleneck

SymptomBottleneck AreaInvestigate With
Deploys take minutesImage size / pull timedocker images, docker system df
CI builds are slowLayer cache invalidationBuild logs, docker history
Containers restart randomlyMemory limits / OOMdocker inspect --format '{{.State.OOMKilled}}'
Host disk fills upUnmanaged storage growthdocker system df -v, log file sizes
Services fail at startupDependency readinessHealth check status, docker compose logs

Focus on the one thing that causes the most pain.

Step 3: Optimize (One Change at a Time)

Apply exactly one optimization, then measure:

❌ Bad: "I switched to Alpine, added cache mounts, rewrote the Dockerfile, 
and enabled BuildKit all at once"

✓ Good: "I switched the base image from node:20 to node:20-alpine"

Why one at a time? If the result is worse, you know exactly what caused it. If you change five things, you cannot attribute the result.

Step 4: Validate

Run the same measurements from Step 1 and compare:

# Compare image sizes
docker images app:baseline app:optimized

# Compare build times
time docker build --no-cache -t app:optimized .

# Compare resource usage
docker stats --no-stream app-baseline app-optimized

Decision Table

ResultAction
Clear improvement, no regressionsKeep -- standardize the change
Marginal improvement, no regressionsConsider keeping, document tradeoffs
No measurable improvementRollback -- not worth complexity
Any regression (reliability, security)Rollback immediately

Standardize Successful Changes

When an optimization proves successful:

  1. Update Dockerfile templates so new services get the optimization automatically
  2. Document the change -- what was done, why, and the measured result
  3. Add CI checks if possible (e.g., max image size threshold)

Optimization Priority Guide

Start with the highest-impact, lowest-risk optimizations:

PriorityOptimizationImpactRisk
1Multi-stage buildsHigh (image size)Low
2.dockerignoreMedium (build speed)None
3Dependency-first layer orderingHigh (build speed)None
4Log rotation in daemon.jsonHigh (disk)Low
5Resource limits (--memory, --cpus)Medium (stability)Low
6BuildKit cache mountsMedium (build speed)Low
7Network segmentationMedium (security)Low
8Health checks with depends_onMedium (reliability)None

Key Takeaways

  • Always baseline before optimizing. No baseline = no proof.
  • Change one thing at a time and measure the result.
  • Rollback anything that causes regression in reliability or security.
  • Standardize successful optimizations into templates and CI checks.
  • Prioritize high-impact, low-risk optimizations first.

What's Next