Many software bugs seem overwhelmingly difficult to analyze at first, but can be tackled by thinking out of the box, reading and using the right tools. In this blog post I am sharing some of my experiences and point you to a bunch of resources that can help you with getting started.
Figure 1: Large Hadron Collider at CERN
Thinking out of the box
Conventional engineering background
What do I mean by thinking out of the box? I know many software engineers that studied some kind of conventional engineering discipline at first and ended up developing software. I am for example a chemical engineer and two of my friends have degrees in electrical engineering and physics. To tackle complex bugs it seems advantageous to have a green belt in software engineering and a brown belt in for example chemical engineering.
An example form the chemistry lab
Here is an example that shows how thinking out of the box can look like and how the other major component "Do your homework" will expand your paradigm portfolio. During my first semester at the university I needed to complete a lab in qualitative inorganic analytics. The final test is to analyze a mixture of several compounds using a well known procedure. When I received my sample, I could have just started applying the procedure and separate step by step the compounds, but I looked at the powder and saw some dark green and shiny crystals in the sample. I knew it couldn't be any of the usual inorganic compounds that we were using in the lab as reference substances. It had to be something special. What could it be? The recommended text book for this lab covered two semesters, this one and the next. The topic in the next semester was building compounds instead of analyzing them. I also knew that the lab assistants where quite lazy and I came to the conclusion that they might just take compounds that get prepared by second semester students and use them as samples in first semester. All I needed to do was to find the substance in the text book that made shiny green crystals and I would have solved a major part of the puzzle. Sure enough, I analyzed 3 of the 10 required ions of the test without even doing any chemistry. I was just observing, reading and "thinking out of the test tube" in this case.
Figure 2: Green crystals
Software of course is a little bit different, but it always helps to step back from the debugger and imagine a user interacting with the application. Ask yourself questions like:
- What is different with this deployment?
- Could the software bug caused by hardware problems e.g. intermittent network?
- What changed in the environment before the problem happened?
- 49.7 day issue?
- Windows XP file share limitations?
- Issue with UX themes?
- Firewall issue?
- Virus Scanner or backup software issue?
- etc. etc. etc.
Do your homework
And do it regularly and ahead of time. Collect knowledge about the platform, OS, the application domain's DOs and DONTs, Pattern and Anti-Patterns. The more you know about debugging the better prepared you will be. Consider a wide variety of sources. Books, magazine articles, screen casts, training, forums, developer online communities like Codeproject. Here is a list of books that I read to get prepared for the wildest debugging challenges:
| ||This is the latest edition of John Robbins debugging books. I haven't read it yet, but it is on my wish list. It is targeted towards the .NET platform, whereas his first book covers exclusively Win32 and Intel x86 systems.|
| ||This book is a great introduction to debugging on the Win32 platform. It covers basic debugging scenarios like crashes, memory leaks and dead locks. It also has an introduction to interpreting disassembled code.|
| ||Eldam Eilam provides a lot of insights about native code. Imagine debugging and fixing software without rebuilding it. Based on the knowledge of this book I was able to change the value of a constant, hardcoded timeout parameter in the binary image of an executable to fix a bug that was caused by a premature termination of the process during a shutdown procedure. The book has complete examples of analyzing native code and has a nice list of tools that will help you navigating in this fascinating world.|
| ||Know your platform! Jeffrey covers memory management, threads, processes and much more. The book has also a long list of utilities that I used for analyzing bugs. There is for example one utility that maps out virtual memory. This can help chasing after access violations.|
| ||Some of the debugging scenarios actually require you to use a disassembler. Getting your head around the Intel x86 instruction set will go a long way for these cases. Richard is targeting in this book the Linux platform and the GNU assembler, but being flexible and able to debug on different platforms is always a big plus.|
| ||With this book Mark and Co. take you on a discovery voyage through the Windows platform. They explain the core systems of the OS and explore their inner workings using their famous Sysinternal tools. This was dry read, at least for me.|
It's more like measure, experiment, change and observe. There are a few standard procedures that get you started with analyzing a bug. In most cases they provide the first pieces of the puzzle, but don't give you the whole picture yet. First I am listing a set of tools that quickly gain you access to the low hanging fruit. In the next section I am going to explain different approaches to "shake the tree" and also get the high hanging specimen.
Use Tools to pick the low hanging fruit
|WinDBG ||The tool that Microsoft engineers use to debug their stuff. The package contains a Windows GUI debugger and a command line debugger. It also comes with a vb script "ADPlus.vbs" that lets you configure the debuggers with XML configuration files. This is my tool of choice to capture and analyze memory dumps.|
|ProcessExplorer ||This is a great tool for poking around in running processes. I use it to checkout command line parameters, loaded dlls, threads with the most CPU usage, Windows handles and the Process inheritance trees. It is also good to find out in which svchost a service is running.|
|Depends ||This tool ships with Visual Studio or the Windows SDK. It lets you analyze dll dependencies of executable files and dlls.|
|Spy++ ||One of my favorites. This tool lets you poke into the Windowing subsystem and analyze messages and windows structures. Spy++ helped me once solving a tricky buffer overrun problem. In this case I used it to see a string that was truncated because the text box control in a dialog window was not wide enough. I was curious to see what the whole string would look like. I pointed to the control and discovered that the covered string ended with the famous square character. The alarm light went on immediately: BUFFER_OVERRUN. Sure enough I found the cause of a system deadlock. The c-style string was a field in a c-style structure and an oversized character array caused the corruption of an adjacent integer field that was supposed to be zero.|
|Hexworkshop ||If you need to inspect raw memory, possibly switch between little endian and big endian, write to raw memory and see it interpreted as many different data types, then this utility is what you are looking for. |
|Perfmon ||The windows performance monitor is part of the OS. There are countless counters that are provided by the OS, but also consider the counters that are eventually available in your software. Microsoft has a Performance Counter Reference for Windows 2003 Server.|
|Wireshark ||Formerly known as Ethereal. This is an open source network monitor. Great help for analyzing issues that might be related to network traffic.|
|Firefox Firebug ||Firebug is the greatest tool for debugging web content.|
|Reflector ||.NET IL Disassembler and introspection tool. Greatest resource for understanding .NET assemblies.|
|RegMon ||Sysinternals tool to monitor transactions with the registry.|
|FileMon ||Sysinternals tool to monitor file access.|
|Dumpbin ||This utility ships with Visual Studio and the Windows SDK. It is helpful locating memory addresses and mapping them to function exports. You might be able for example to identify the function that caused the crash with dumpbin, a Dr. Watson report and the original executable file image.|
|LogParser ||Great tool if you have to sift through millions of lines of trace messages to filter out the important ones. It is extendable and can be used to use any kind of binary trace format. The tool uses its own SQL query engine for filtering.|
|OllyDbg ||A very nice and light weight click and run debugger. It fits on a floppy disk and can be started from there. It also has a powerful trace-log feature that captures the register states of thousands of instructions before the actual crash.|
Shake the tree to get the high hanging fruit
You got already this far. In your cart is a customer that gets more and more irritated and worried, a memory dump, log files and performance monitor results. You completed an initial analysis. But no luck so far. What is next? Staging the customer's system in the lab? No, not possible. There are two many dependencies that can be easily simulated. And we don't have the hardware to begin with. But don't give up just yet. In the following section I describe a few approaches to get more information about the bug and the environment that set it free.
Build a model
Analyze the customers application and start building your own small model of its application. Focus on features that seem to be unique to the application that encountered the bug. Start very basic and build the model in the same pace as you take the production application apart. This exercise will not only help you understanding your customer's application, but also gives you a test system that will make it much easier to reproduce the bug, analyze it and test a fix for it. This scaled down application will also make it easier to study the bug with instrumented versions of your software modules.
Break your model
If the model is complete and you still can't reproduce the bug, start getting creative and think about possible ways to break your model. Instrument your code to simulate exceptions and errors that are similar to the ones captured in the memory dump. Take these injected bugs deeper and deeper in the call stack. Hopefully you get lucky and this exercise will spark the right idea.
Run your model with simulated load. For example keep database transactions at the same level as the production application. Consider using GUI automation tools to simulate user interaction. Leave the simulation running for long periods of time e.g. over the weekend.
Substitute tiers and modules
If your software allows for plug-and-playable modules and providers, then try to switch to a different provider that has for example more analytics built in. In one case I used this approach and replaced a component that interfaces to external devices with an equivalent component that we used for testing and simulation. The latter program provided extensive logging of trace messages. Reading these traces that haven't been available in the production deployment showed the famous square character again. The alarm light went on: BUFFER_OVERRUN. In an ASP.NET scenario I could imagine replacing an XML provider with a SQL Server provider and use its profiler tool to trance the interactions between the web server and the data store.
Being confronted with an occasional bug in a complex system can be quite overwhelming. There are three disciplines that help you to stay strong:
- Read and learn about your platform and how to debug it.
- Use tools to quickly get the low hanging fruit. This might be enough in most of the cases to identify the bug.
- Think out of the box and try to be creative. Build a model, play with it, break it and fix it. In the worst case, you won't have found the cause of the bug, but you learned at least something new about your software.