Nagios vs Juju monitoring

Kapil Thangavelu kapil.thangavelu at canonical.com
Fri Nov 23 15:26:52 UTC 2012


On Fri, Nov 23, 2012 at 9:50 AM, Marco Ceppi <marco at ondina.co> wrote:

> That's an interesting bug(?). Looking at the hooks, it appears the same
> hook for relation-departed is the one registered with relation-broken.
> I'm not sure of the workflow in Juju with regards to "machine manually
> removed" but odds are it triggers a relation-broken (as in an unclean
> departure) which results in Nagios charm removing the relation as if
> remove-relation command was run.


a manual machine removal should trigger a relation-depart firing with the
unit whose machine has gona away not a relation-broken. ie. its a unit
removal not a removal of the relation between services. if the nagios charm
treats broken == depart, that's an issue in the charm, although if that's
the case in the charm is a bit unclear to me. restoring that machine should
trigger a relation-joined hook execution, my initial reading of thomas's
email suggests that behavior is indeed what happens with the caveat that
the nagios charm is removing the unit on depart instead of just leaving it
in the nagios config. the distinction for the hook here is subtle, the unit
is not available, but in this context the nagios charm would like to
distinguish between a failure (unreachable) vs a removal (remove-unit,
state is gone). to me this suggests a use for an additional hook command to
query for the state existence of the unit which would allow for the charm
to handle this unambigiously (relation-unit-exists <unit/id> or just
unit-exists <unit/id>). in the absence of that the charm could resort to
some implicit heurestics (depart feeds queue, cron checks queue, if after X
time unit is still dead, remove perm).


> Since I'm not exactly sure of the charm
> relation workflow (for instances does a destroy-service also fire a
> relation-broken event, or does it use the "cleaner" relation-departed
> event), but that might be one place to start investigating.
>
> service removal uses the relation broken hook on other related services,
as all units of the remove service are gone in that context, their isn't
any value in processing them individually vs the hook that signifies the
entire set and that the rel itself is permanently gone.


Cheers,

Kapil




> Thanks,
> Marco Ceppi
>
> On 11/23/2012 09:38 AM, Thomas Leonard wrote:
> > Hi guys,
> >
> > We're trying to use Juju to set up a nagios service to monitor various
> > other Juju-deployed services. We deployed nagios and some services and
> > added relations between them. Nagios showed all the hosts in green. Good.
> >
> > Then, to test it, we paused one of the service VMs. We were hoping
> > that nagios would show the failed service in red and send a warning
> > notification.
> >
> > Instead, the service simply disappeared from the nagios display. It
> > looks like Juju detected that the machine wasn't available and removed
> > the relation.
> >
> > We unpaused the machine and then manually restarted juju-machine-agent
> > on it and it reappeared in nagios.
> >
> > It seems like this behaviour isn't very useful; it is intentional?
> >
> >
>
>
> --
> Juju mailing list
> Juju at lists.ubuntu.com
> Modify settings or unsubscribe at:
> https://lists.ubuntu.com/mailman/listinfo/juju
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/juju/attachments/20121123/b4580a4c/attachment.html>


More information about the Juju mailing list