= The Art Of Debugging =
=== From: YouTube - goto conferences - 2018 Debugging Under Fire: Keep you Head when Systems have Lost their.  Joyent ===


== Debuging steps ==
 * The art of debugging isn't to guess the answer.

 * It is to be able to ask the right questions to know how to answer them
 * Answered questions are facts, not hypotheses
 * Facts form constraints on future questions and hypotheses
 * As facts beget questions which beget observations and more facts, hypotheses becom more tightly constrained
   * like a cordon being cinched around the truth.


== Culture of debugging ==
 * Debugging must be viewed as the process by which systems are understood and improved, not merely as the process by which bugs are made to go away !
 * Too often, we have found that beneath innocent wisps of smoke lurk raging coal infernos
 * Engineers must be empowered to understand anomalies!
 * Engineers must be empowered to take the extra time to build for debuggability - we must be secure in the knowledge that this pays later dividends!

== Debugging during an outage ==
 * When systems are down, there is a natural tension:
   * do we optimize for recovery of understanding ?

     * "Can we resume service without losing information?"
     * "What degree of service can we resume with minimal loss of information?"

  * Over emphasizing recovery with respect to understanding may have the problem undebugged or (worse) exacerbate the problem with a destructive but unrelated action.

=== The peril of overemphasizing recovery ===
 * Recovery in lieu of understanding normalizes broken software
 * If it becomes culturally ingrained the dubious principle of software recovery has toxic corollaries e.g:
   * BAD Anti-Patterns
   * Software should tolerate bad input (viz. "npm isntall" NO)
   * Software should "recover" from fatal failures (uncaught exceptions, segmentation violations etc.)
     * NO it should die, dump state we can debug this.
   * Software should not assert correctness of its state.
   * '''These anti-patterns impede debuggability!'''

=== Debugging after an outage ===
 * After an outage we must debug to complete understanding.
 * In mature systems we can expect cascading failures - which can be exhausting to fully unwind
 * It will be (very!) tempting after an outage to simply move on, but every service failure (outage-inducing or not) represents an opportunity to advance understanding.
 * Software engineers must be encouraged to understand their own failures to encourage designing for debug-ga-bility.

=== Debugging under fire ===
 * It will always be stressful to debug a service that is down
 * When a service is down we must balance the need to restore service with the need to debug it
 * Missteps can be costly; taking time to huddle and think can yield a better, safer path to recovery and root-cause
   * take time discuss.
 * In massive outages, parallelize by having teams take different avenues of investigation
 * Viewing outages as opportunities for understanding allows us to develop software cultures that value debuggability!



   



----
CategoryCode CategoryLinux