= The Art Of Debugging = === From: YouTube - goto conferences - 2018 Debugging Under Fire: Keep you Head when Systems have Lost their. Joyent === == Debuging steps == * The art of debugging isn't to guess the answer. * It is to be able to ask the right questions to know how to answer them * Answered questions are facts, not hypotheses * Facts form constraints on future questions and hypotheses * As facts beget questions which beget observations and more facts, hypotheses becom more tightly constrained * like a cordon being cinched around the truth. == Culture of debugging == * Debugging must be viewed as the process by which systems are understood and improved, not merely as the process by which bugs are made to go away ! * Too often, we have found that beneath innocent wisps of smoke lurk raging coal infernos * Engineers must be empowered to understand anomalies! * Engineers must be empowered to take the extra time to build for debuggability - we must be secure in the knowledge that this pays later dividends! == Debugging during an outage == * When systems are down, there is a natural tension: * do we optimize for recovery of understanding ? * "Can we resume service without losing information?" * "What degree of service can we resume with minimal loss of information?" * Over emphasizing recovery with respect to understanding may have the problem undebugged or (worse) exacerbate the problem with a destructive but unrelated action. === The peril of overemphasizing recovery === * Recovery in lieu of understanding normalizes broken software * If it becomes culturally ingrained the dubious principle of software recovery has toxic corollaries e.g: * BAD Anti-Patterns * Software should tolerate bad input (viz. "npm isntall" NO) * Software should "recover" from fatal failures (uncaught exceptions, segmentation violations etc.) * NO it should die, dump state we can debug this. * Software should not assert correctness of its state. * '''These anti-patterns impede debuggability!''' === Debugging after an outage === * After an outage we must debug to complete understanding. * In mature systems we can expect cascading failures - which can be exhausting to fully unwind * It will be (very!) tempting after an outage to simply move on, but every service failure (outage-inducing or not) represents an opportunity to advance understanding. * Software engineers must be encouraged to understand their own failures to encourage designing for debug-ga-bility. === Debugging under fire === * It will always be stressful to debug a service that is down * When a service is down we must balance the need to restore service with the need to debug it * Missteps can be costly; taking time to huddle and think can yield a better, safer path to recovery and root-cause * take time discuss. * In massive outages, parallelize by having teams take different avenues of investigation * Viewing outages as opportunities for understanding allows us to develop software cultures that value debuggability! ---- CategoryCode CategoryLinux