The Art Of Debugging
From: YouTube - goto conferences - 2018 Debugging Under Fire: Keep you Head when Systems have Lost their. Joyent
Debuging steps
- The art of debugging isn't to guess the answer.
- It is to be able to ask the right questions to know how to answer them
- Answered questions are facts, not hypotheses
- Facts form constraints on future questions and hypotheses
- As facts beget questions which beget observations and more facts, hypotheses becom more tightly constrained
- like a cordon being cinched around the truth.
Culture of debugging
- Debugging must be viewed as the process by which systems are understood and improved, not merely as the process by which bugs are made to go away !
- Too often, we have found that beneath innocent wisps of smoke lurk raging coal infernos
- Engineers must be empowered to understand anomalies!
- Engineers must be empowered to take the extra time to build for debuggability - we must be secure in the knowledge that this pays later dividends!
Debugging during an outage
- When systems are down, there is a natural tension:
- do we optimize for recovery of understanding ?
- "Can we resume service without losing information?"
- "What degree of service can we resume with minimal loss of information?"
- Over emphasizing recovery with respect to understanding may have the problem undebugged or (worse) exacerbate the problem with a destructive but unrelated action.
- do we optimize for recovery of understanding ?
The peril of overemphasizing recovery
- Recovery in lieu of understanding normalizes broken software
- If it becomes culturally ingrained the dubious principle of software recovery has toxic corollaries e.g:
- BAD Anti-Patterns
- Software should tolerate bad input (viz. "npm isntall" NO)
- Software should "recover" from fatal failures (uncaught exceptions, segmentation violations etc.)
- NO it should die, dump state we can debug this.
- Software should not assert correctness of its state.
These anti-patterns impede debuggability!
Debugging after an outage
- After an outage we must debug to complete understanding.
- In mature systems we can expect cascading failures - which can be exhausting to fully unwind
- It will be (very!) tempting after an outage to simply move on, but every service failure (outage-inducing or not) represents an opportunity to advance understanding.
- Software engineers must be encouraged to understand their own failures to encourage designing for debug-ga-bility.
Debugging under fire
- It will always be stressful to debug a service that is down
- When a service is down we must balance the need to restore service with the need to debug it
- Missteps can be costly; taking time to huddle and think can yield a better, safer path to recovery and root-cause
- take time discuss.
- In massive outages, parallelize by having teams take different avenues of investigation
- Viewing outages as opportunities for understanding allows us to develop software cultures that value debuggability!