Question 6
Handling production issues
A critical bug appears in production affecting many users. How do you lead the response, both technically and with stakeholders?
Answer outline
Start by stabilizing the system — assess impact and stop the bleeding with a rollback, feature flag, or hotfix when possible. Establish a DRI (directly responsible individual) and loop in engineering, product, and QA in parallel.
Use logs, crash reports, and metrics to nail scope and move toward root cause. Keep the debugging group small and focused.
Once understood, ship the smallest safe fix, validate it, and roll out carefully while watching key metrics. After resolution, run a blameless postmortem — what happened, why, and what changes to testing, monitoring, or process prevent recurrence.
Principles
- Stabilize first — rollback, flag, or hotfix before spending time diagnosing.
- Assign a single DRI immediately — clear ownership prevents duplicated effort.
- Ship the smallest safe fix under pressure, then validate before full rollout.
- Communicate early and regularly with stakeholders, even when the picture is incomplete.
- Run a blameless postmortem — turn the incident into systemic improvements.