Async watches and their creators

Fri Apr 15 13:33:09 UTC 2011

Hi Folks,

This is bit technical, regarding the ensemble internals.

I've noticed we have a pattern in our code that leads to some subtile
timing issues, that i wanted to address.

We have a several watcher based apis in the ensemble code base, that
allow us to react to notifications of changes to the zookeeper
state. We've constructed several watch based apis, but one of the
more common themes is for the watch api to be exposed in a domain
object specific fashion with a callback, as an example we have

   machine_state.watch_assigned_units(callback)

Which will create a callback invocation whenever the assigned units
for a machine are changed.

The other common aspect of these watch apis, is that they will fire
initially (and immediately) based on the current state of zookeeper.
This state is in turn used by the callbacks to setup/initialize its
own state and behavior. Its a key aspect of ensemble being a state
observation system and not simply event messaging, any component in
the system can recover based on its environment/cluster stored state.

The problem arries in that the apis which setup these watches are
returning after they have setup the watch, when as result of creating
the watch they have pending initialization work outstanding. This
leads to some timing issues when dealing with initilization. I've
identified this problem in a few places in the codebase.

  - ensemble.unit.lifecycle.UnitLifecycle.start

At the return of start method, there is still pending activity in the
form of the initial the unit relation watch processing occuring in the
background.

This typically leads to some testing difficulties on transition to start,
because the unit relation processing is still occuring in the
background by the time the test ends and cleans up the connection. The
tests have evolved other mechanisms for waiting or stopping this
background activity, but really in some cases the use of those techniques
is superflous to the test, and artificial to its purpose.

  - debug-log/debug-hook subcommands.

 I was examining why the start and install hooks are seldom logged by
the debug-log subcommand, and noticed the same problem exists there, the 
watch is set, and the code continues to execute the install and start 
hooks which are done by the time the debug-log becomes active. The same 
problem holds true for the debug-hook watch setup.

The fixed i'd propose, is a watch callback wrapper factory, which
returns a deferred that will fire after the first time the callback is
invoked. The watch consumer api will only return after this deferred has 
fired. This ensures that at the conclusion of the watch api consumer
function, not only are the watches in place, which is what they currently
guarantee, but the watches have fired once, and are done with
their initial setup.