It is nearly three decades since I write bugs code and I am maturing the idea that each new bug I find should inspire a way to avoid it in the future or, at least, to reduce the chance of falling again in the same mistake. During the past weeks my team and I discovered two bugs and here is my ideas for improving the software quality.
The first bug is likely to be the longest-standing bug I had. As you may know, I am working on a proprietary field bus. We found evidence of the problem at least three years ago. Occasionally our devices lose a frame. It is not tragic – it is a frame in 250 – but, since the protocol does not account for transmission acknowledgment, it may result in wrong device behavior.
We got the assembly driver from our customer, therefore the code is granted to be correct. They use the same code in hundreds of devices and they said they never have such a bad figure in frame reception.
This bug had a chance to live so long for two reasons – first, the customer accepted our firmware with this known problem; second we were quite convinced that this problem should have appeared in their devices as well.
An event broke this impasse – first, we had big troubles in an installation and eventually, it turned out that one device has a slightly defective crystal that caused major frame dropout, wreaking havoc on all the system. This led to some investigation by the hardware guys that told me that crystal precision was very critical in device correctness. The crystal component has a tolerance of 0.5% and our device starts behaving weirdly when the frequency goes 0.4% away from the nominal frequency. This was clearly not acceptable and a solution was urged.
I went to some fruitless investigations before discovering that another firmware, mainly developed by our customer, was much more reliable than ours. After much head-scraping and head-banging against the nearest wall, I went through the aging documentation that the customer gave us to integrate the bus driver.
There I found that it was required to set a global variable, let’s call it FLEX with half of the value of a global constant.
Then I turned to the code and found an old comment of mine that went like this /* FLEX and GIZMO seem not to be used anywhere in the code, they must be private variables */
Of course, removing the comment and properly initializing the variable sent the problem the way of Dodos.
How this could have been prevented? That’s a simple shot – avoid the programmer manually setting obscure and arcane magic values in oddly named global variables. Either provide a complete setup function or let the magic value be computed at compile time.
I’ll write about the other bug, tomorrow. In the meantime, your comments are welcome, as always.