As promised, here I am to write about the other bug. This one needs a bit a background to be understood. At the beginning of the current project (that means some 2 years ago) I designed a software timer for providing time services. User code that needs notification about time event may register the software timer specifying the time event. When the time event occurs the user code is called back.
Software timers are identified by a numeric id. Being an 8bit architecture with a severely constrained data memory, I picked an unsigned 8 bit number as timer id. This id indexes a private array containing the software timer parameters.
There are two kind of timers – repeating and one shot. A repeating timer repeatedly calls you back at given intervals, while an one-shot timer just calls you back once when it expires.
Repeating timers need to be explicitly canceled, but when I was designing the class, it felt like a good idea to let one-shot timer self-cancel pending their expiration.
Time passed, Software timer component being tried and tested became trusted. Then I got some mysterious bugs – code that used to work stopped being called back by timer. It took a while to discover what happened. After the one-shot timer executed its callback the user code still had some dangling references to the timer id.
Occasionally user code could inadvertently cancel the timer pointed to by the dangling reference resulting either in an assertion about a non-existing timer (turned into a reset in release code), or in a deletion of an innocent bystander timer.
I thought a while on the matter and I came to the conclusion that it is not good for the timer to self-delete. Rather is must be up to the user code to get rid of the timer. Unfortunately I cannot apply the lesson learned on the current project because the code base is too large and we are releasing, nonetheless I’ll keep a reminder for the next project.
While talking about timer callback, a colleague found another problem – if you perform timer registration and removal during the callback you are in for serious trouble. In fact the code for the timer dispatcher was about like this –
for( id = 0; id < TIMER_COUNT; ++id ) { if( timerHasExpired(id) ) { callbackUserCode( id ); if( timerIsOneShot( id )) { deleteTimer( id ); } else { // timer is repeat reloadTimer( id ); } } }
This code appears good as long as nothing happens on the timers repository during callback. Consider what happens if a one-shot timer is canceled and a new timer is created in the callback code…
There are two ways to deal with this – either queue timer creation/deletion operation during the callback and execute them when the above code is completed. The other is to re-consider the current time nature after the callback, i.e. consider the timer repository “volatile” across the callback. This is our choice –
for( id = 0; id < TIMER_COUNT; ++id ) { if( timerHasExpired(id) ) { callbackUserCode( id ); if( timerHasExpired( id )) { if( timerIsOneShot( id )) { deleteTimer( id ); } else { // timer is repeat reloadTimer( id ); } } } }
Lessons learned – 1) don’t provide classes that mixes automatic and explicit object disposal – either the user must always delete objects or the user must never delete objects. 2) critical runs are not just about interrupts, but they may occur in callback as well. Calling back code should be programmed defensively expecting that the user code may wreak havoc on the structure. Should this not be possible then a state variable declaring whether the class is in “callback execution” or “normal execution” must be provided and functions that manipulates class state have to be made aware of the state so to act directly or to defer actions at the end of “callback execution” state.