The Art Of Debugging

Debuging steps

The art of debugging isn't to guess the answer.
It is to be able to ask the right questions to know how to answer them
Answered questions are facts, not hypotheses
Facts form constraints on future questions and hypotheses
As facts beget questions which beget observations and more facts, hypotheses becom more tightly constrained
- like a cordon being cinched around the truth.

Debugging must be viewed as the process by which systems are understood and improved, not merely as the process by which bugs are made to go away !
Too often, we have found that beneath innocent wisps of smoke lurk raging coal infernos
Engineers must be empowered to understand anomalies!
Engineers must be empowered to take the extra time to build for debuggability - we must be secure in the knowledge that this pays later dividends!

When systems are down, there is a natural tension:
- do we optimize for recovery of understanding ?
  - "Can we resume service without losing information?"
  - "What degree of service can we resume with minimal loss of information?"
- Over emphasizing recovery with respect to understanding may have the problem undebugged or (worse) exacerbate the problem with a destructive but unrelated action.

After an outage we must debug to complete understanding.
In mature systems we can expect cascading failures - which can be exhausting to fully unwind
It will be (very!) tempting after an outage to simply move on, but every service failure (outage-inducing or not) represents an opportunity to advance understanding.
Software engineers must be encouraged to understand their own failures to encourage designing for debug-ga-bility.

It will always be stressful to debug a service that is down
When a service is down we must balance the need to restore service with the need to debug it
Missteps can be costly; taking time to huddle and think can yield a better, safer path to recovery and root-cause
- take time discuss.
In massive outages, parallelize by having teams take different avenues of investigation
Viewing outages as opportunities for understanding allows us to develop software cultures that value debuggability!