[Bug 1878548] Re: There are cases when masakari-hostmonitor will recognize online nodes as offline and send (in)appropriate notifications to Masakari

Edward Hope-Morley 1878548 at bugs.launchpad.net
Sat Dec 2 14:16:08 UTC 2023


Verified victoria-proposed with the following output:

$ apt-cache policy masakari-monitors-common
masakari-monitors-common:
  Installed: 10.0.0-0ubuntu1~cloud1
  Candidate: 10.0.0-0ubuntu1~cloud1
  Version table:
 *** 10.0.0-0ubuntu1~cloud1 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu focal-proposed/victoria/main amd64 Packages
        100 /var/lib/dpkg/status
     9.0.0-0ubuntu0.20.04.1 500
        500 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
     9.0.0~b3~git2020041013.e225e6d-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages

Tested fencing communication between a compute host and all
masakari/corosync units e.g.

# do this on all masakari units
juju run -a masakari -- sudo iptables -I INPUT -p tcp -s 10.0.0.145 --sport 3121 -j REJECT

# then wait for compute host 10.0.0.145 to get rebooted
ubuntu at maas:~/stsstack-bundles/openstack$ openstack notification list
+--------------------------------------+----------------------------+--------+--------------+--------------------------------------+----------------------------------------------------------------------------+
| notification_uuid                    | generated_time             | status | type         | source_host_uuid                     | payload                                                                    |
+--------------------------------------+----------------------------+--------+--------------+--------------------------------------+----------------------------------------------------------------------------+
| d8406e44-169e-4c70-b496-051827593c0e | 2023-12-02T14:12:35.000000 | new    | COMPUTE_HOST | 6c48bfc2-0fd1-4f71-9392-986ab8e1b401 | {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'} |
+--------------------------------------+----------------------------+--------+--------------+--------------------------------------+----------------------------------------------------------------------------+

Only the single node was rebooted and no others.

** Tags removed: verification-victoria-needed
** Tags added: verification-victoria-done

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1878548

Title:
  There are cases when masakari-hostmonitor will recognize online nodes
  as offline and send (in)appropriate notifications to Masakari

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Committed
Status in Ubuntu Cloud Archive victoria series:
  Fix Committed
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in masakari-monitors:
  Fix Released
Status in masakari-monitors ussuri series:
  Fix Released
Status in masakari-monitors victoria series:
  Fix Released
Status in masakari-monitors wallaby series:
  Fix Released
Status in masakari-monitors xena series:
  Fix Released
Status in masakari-monitors package in Ubuntu:
  Fix Released
Status in masakari-monitors source package in Focal:
  Fix Committed

Bug description:
  [Issue]
  ComputeNodes are managed by pacemaker_remote in my environment.
  When one ComputeNode is isolated in the network, masakari-hostmonitors on the other ComputeNodes will send failure notification about the isolated ComputeNode to masakari-api.
  At that time, the isolated masakari-hostomonitor will recognize other ComputeNodes as offline. So it sends failure notification about online ComputeNodes.
  As a result, masakari-engine runs the recovery procedure to online ComputeNodes.

  [Cause]
  The current masakari-hostmonitor can't determine whether or not it is isolated in the network if ComputeNodes are managed by pacemaker_remote.

  masakari-hostmonitor with pacemaker(not remote) will wait until it is killed if it is isolated in the network. It is implemented in the following code.
  <https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/hostmonitor/host_handler/handle_host.py#L398-L402>

  But masakari-hostmonitor with pacemaker_remote won't determine if it is isolated.
  <https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/hostmonitor/host_handler/handle_host.py#L93-L95>

  [Solution]
  The ComputeNode managed by pacemaker_remote should determine recognize itself as offline when it is isolated.
  The state monitoring process should be skipped in that case.

  See comment #11 for how yoctozepto managed to reproduce something
  similar to the described.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1878548/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list