Optimization Strategy Framework
Random optimization leads to random results. This lesson provides a structured framework for identifying bottlenecks, making controlled changes, and proving improvement with data.
The Four-Step Framework
flowchart LR
A["1. Baseline<br/>Measure current state"] --> B["2. Identify<br/>Find the bottleneck"]
B --> C["3. Optimize<br/>Apply one change"]
C --> D["4. Validate<br/>Prove improvement"]
D -->|"Better"| E["Standardize"]
D -->|"Worse/Same"| F["Rollback"]
style A fill:#e3f2fd,stroke:#1565c0
style B fill:#fff3e0,stroke:#ef6c00
style C fill:#e8f5e9,stroke:#2e7d32
style D fill:#f3e5f5,stroke:#7b1fa2
style F fill:#ffebee,stroke:#c62828
Step 1: Baseline
Before changing anything, capture metrics:
# Image sizes
docker images --format 'table {{.Repository}}\t{{.Tag}}\t{{.Size}}'
# Disk usage
docker system df
# Container resource usage
docker stats --no-stream
# Build time
time docker build -t app:baseline .
# Startup time (time from docker run to health check passing)
time docker compose up -d && \
while ! docker inspect app --format '{{.State.Health.Status}}' | grep -q healthy; do sleep 1; done
Save these numbers. Without a baseline, you cannot prove improvement.
Step 2: Identify the Bottleneck
| Symptom | Bottleneck Area | Investigate With |
|---|---|---|
| Deploys take minutes | Image size / pull time | docker images, docker system df |
| CI builds are slow | Layer cache invalidation | Build logs, docker history |
| Containers restart randomly | Memory limits / OOM | docker inspect --format '{{.State.OOMKilled}}' |
| Host disk fills up | Unmanaged storage growth | docker system df -v, log file sizes |
| Services fail at startup | Dependency readiness | Health check status, docker compose logs |
Focus on the one thing that causes the most pain.
Step 3: Optimize (One Change at a Time)
Apply exactly one optimization, then measure:
❌ Bad: "I switched to Alpine, added cache mounts, rewrote the Dockerfile,
and enabled BuildKit all at once"
✓ Good: "I switched the base image from node:20 to node:20-alpine"
Why one at a time? If the result is worse, you know exactly what caused it. If you change five things, you cannot attribute the result.
Step 4: Validate
Run the same measurements from Step 1 and compare:
# Compare image sizes
docker images app:baseline app:optimized
# Compare build times
time docker build --no-cache -t app:optimized .
# Compare resource usage
docker stats --no-stream app-baseline app-optimized
Decision Table
| Result | Action |
|---|---|
| Clear improvement, no regressions | ✓ Keep -- standardize the change |
| Marginal improvement, no regressions | Consider keeping, document tradeoffs |
| No measurable improvement | Rollback -- not worth complexity |
| Any regression (reliability, security) | Rollback immediately |
Standardize Successful Changes
When an optimization proves successful:
- Update Dockerfile templates so new services get the optimization automatically
- Document the change -- what was done, why, and the measured result
- Add CI checks if possible (e.g., max image size threshold)
Optimization Priority Guide
Start with the highest-impact, lowest-risk optimizations:
| Priority | Optimization | Impact | Risk |
|---|---|---|---|
| 1 | Multi-stage builds | High (image size) | Low |
| 2 | .dockerignore | Medium (build speed) | None |
| 3 | Dependency-first layer ordering | High (build speed) | None |
| 4 | Log rotation in daemon.json | High (disk) | Low |
| 5 | Resource limits (--memory, --cpus) | Medium (stability) | Low |
| 6 | BuildKit cache mounts | Medium (build speed) | Low |
| 7 | Network segmentation | Medium (security) | Low |
| 8 | Health checks with depends_on | Medium (reliability) | None |
Key Takeaways
- Always baseline before optimizing. No baseline = no proof.
- Change one thing at a time and measure the result.
- Rollback anything that causes regression in reliability or security.
- Standardize successful optimizations into templates and CI checks.
- Prioritize high-impact, low-risk optimizations first.
What's Next
- Continue to Bash Automation Blueprint.