System Design Review Checklist
System design is a complex process that requires careful attention to detail from multiple perspectives. This guide is a reference for both initial architecture and subsequent design changes, supporting the authoring of design documents and the conduct of design reviews. It provides a concise, non‑exhaustive checklist to structure discussions and surface risks. Emphasize the items most relevant to your context.
User Interface
Are there any changes to the user interface?
- Web/mobile UI: What changes where, and why?
- CLI: Exact commands, flags, exit codes, and examples.
- Localization: Copy, dates/numbers, time zones.
External Interface
Are there any changes to external or well-defined internal interfaces?
- Provided APIs: What changes? Is it backward compatible? Versioning strategy?
- Consumed APIs: Any schema/contract changes, rate limits, auth changes?
- File/data exchange: Formats, schemas, encoding, transport, validation rules, compatibility.
- SLAs and deprecation: Communication plan, migration path, sunset timeline.
Include internal contracts (e.g., service-to-service REST, event schemas) where relevant.
Storage
Are there any changes to how storage is used?
- Databases: Schema/index changes, constraints, size growth, query patterns.
- Filesystem and object storage: Formats, paths/buckets, lifecycle rules, encryption.
- Caches and ephemeral stores: Invalidation, TTLs.
- Backups and retention: RPO/RTO, restore tests, archival policies.
Data Compatibility and Migration
Will existing data be compatible with the new code? If not, how will it be migrated safely?
- Online vs. offline migration, duration, and expected impact.
- Backfills, data validation, idempotency, and retry/rollback strategy.
Configuration
Are there configuration changes?
- Env vars, config files, remote config/feature flags, secrets management.
- Environment parity (dev/stage/prod) and how config propagates.
Core Logic
Are there critical changes to business logic or workflows?
- Transactions and consistency: ACID vs. eventual, idempotency keys.
- Error handling and retries: Failure modes, timeouts, dead-letter queues.
- Cache strategy: Invalidation, freshness, fallbacks.
- Data sync/eventing: Ordering, deduplication, exactly-once semantics (if needed).
Security
Are there any security implications?
- Authentication and authorization changes, multi-tenant boundaries.
- Input validation, output encoding, CSRF/CORS, SSRF, injection risks.
- Secrets, keys, tokens: storage, rotation, scoping, and least privilege.
- Encryption in transit/at rest, key management, data minimization/retention.
Performance and Scalability
Are there performance or scalability concerns?
- Load patterns, capacity planning, headroom, and SLOs.
- Throughput/latency targets per critical path; tail latency.
Observability and Traceability
Can you understand and audit system behavior in production?
- Structured logs with correlation/trace IDs and PII handling.
- Audit trails: who did what, when, and from where.
Testability
How will the change be tested effectively?
- Unit/integration/e2e coverage for new and affected paths.
- Test data/seeding, fixtures, and environment parity.
- Non-functional tests (perf, security) and regression suites.
Deployment
How does this affect rollout and operations?
- Zero-downtime concerns: schema changes, compatibility windows, feature flags.
- Rollback plan: toggles, fast revert.
Reliability and Resilience
How does the system behave under failure?
- Redundancy, health checks, timeouts, circuit breakers, backoff.
- Disaster recovery: RTO/RPO, region/AZ strategy, chaos testing.
Copyright © 2016 - 2025 Lessizmo LLC