Pretty much every substantial piece of code has defects in it. They may be huge and important, destined to take down a data center, or they may be minor, only affecting a log message at midnight on a leap year, but they are there!  Developers need to be prepared to handle them. Unfortunately bugs are all different and no one solution works everywhere.  Over the years I have developed a general strategy which I’d like to share.

The big picture step by step approach is:

  • Reproduce It
  • Wild Guess!
  • Isolation

Reproduce It

Reproduce the defect. If you can’t reproduce it then it’s much harder to fix.  If at all possible reproduce it locally in your development environment where you will be working on it.  You want to be able to make the defect occur at will.  This way you can:

  1. Verify its a real defect
  2. Have a way to test your solutions
  3. Study what’s happening in detail

Wild Guess!

It’s your code right!  You probably know that code better than anyone in the known universe.  Trust your feelings and just take a wild guess.  If you are right, you have saved yourself a ton of time and hassle.  If you aren’t right don’t guess again.  Your first instinct, once you understand the problem, is the most likely to be correct.  Guessing again is a waste of time.  You are going to have to do it the hard way.

Isolation

The core strategy here is really Isolation. It’s finding where the bug is happening in your code.  It’s the hard part.  Actually fixing the defect will probably be easy. The most difficult step is almost always finding it. We want to isolate the problem into smaller and smaller code regions until we can pin point exactly where it’s happening.  Ideally we want to be able to toggle the defect on and off at will. This ensures that we have actually found it and aren’t just chasing ghosts.  I like to think of it like isolating a variable in an algebraic equation.  We want to methodically work at it until the x (the defect) is all alone on one side of the equation.

The two broach approaches to this are:

Trace

Follow the execution of the code, drilling down hierarchically from the top level to the low level, like peeling the layers off an onion.  For instance, using a debugger step over the code at a high level (A, B, C) until you trigger the bug.  Then go back and step into the function that triggered and step over those functions (a,b,c) until the defect occurs.  Then go back and step into the function that triggered the defect at that level (1,2,3), and so on, and so on, until you pin point the code that generates the defect.  The same procedure can be done with logging.  Just log  entering and exiting each function at the top level (A,B,C).  Then check the log for where defect happens and add log statements for entering and exiting the next level of functions, and so on, and so on.  It seems really simple and it is, but none the less very powerful.

  • func-A()
  • func-B()
    • func-a()
    • func-b()
    • func-c()
      • func-1()
      • func-2()
        • line-1
        • line-2 defect
      • func-3()
  • func-C()

Eliminate

This technique is most useful when you have something that is happening all the time like a memory leak, but you have no idea where, possibly because it is someone else’s code.  Basically, don’t be afraid to comment out large chunks of code.  Taking the same hierarchical approach as above, methodically comment out functions at the top level (A,B,C) one at a time until the problem disappears. Then un-comment, the component that eliminated the problem and step into that function and methodically comment out functions at that level until the problem is eliminated, and so on and so on.

  • func-A()
  • func-B()
    • func-a()
    • func-b()
    • func-c()
      • func-1()
      • func-2()
        • line-1
        • line-2 defect

Logging is your friend!

Graphical debuggers that step through your code line by line and let you examine variables are obviously extremely useful.  But you will eventually find a situation where you can’t use one.  What if you can only reproduce the problem in the live production system where you are not allowed to install developer tools?  What if the defect you are chasing is a thread timing issue, where any debugger slows down the code so much that the problem can’t occur?  What then?  … Ah my old friend logging.  It’s better to have had it setup and in place the whole time.  So it’s there when you need it.