[Bug 1825843] Re: systemd issues with bionic-rocky causing nagios alert and can't restart daemon

Tue Apr 23 14:41:29 UTC 2019

After IRC conversations and more testing, I think that I have a clean
reproduction of this bug, along with a root cause.

The root cause: the charm takes control of the radosgw service, and
changes the name, but doesn't remove the old nrpe check.

To reproduce:

1) juju deploy the following bundle: https://paste.ubuntu.com/p/wpVt447Vwz/
2) juju ssh into ceph-radosgw/0 and note that there is a "check_radosgw.cfg" in /etc/nagios/nrpe.d.
3) Trigger the config-changed hooked on the ceph-radosgw charm. You might change the number of ceph replicas, for example.
4) Note that there is now a "check_ceph-radosgw@<hostname>.cfg" check, in addition to the check_radosgw.cfg check.
5) Run both checks (cat the files to get the command). Note that the new, hostname based check succeeds, but the old check does not.

The original check will also fail if you run it during step 2,
suggesting that the service has been changed, but the nagios monitoring
is not updated until the config-changed hook runs.

This bug can be closed once the charm places checks in
/etc/nagios/nrpe.d that accurately reflect the running services, and
cleans up outdated checks as well.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1825843

Title:
  systemd issues with bionic-rocky causing nagios alert and can't
  restart daemon

Status in OpenStack ceph-radosgw charm:
  Triaged
Status in ceph package in Ubuntu:
  Invalid

Bug description:
  During deployment of a bionic-rocky cloud on 19.04 charms, we are
  seeing an issue with the ceph-radosgw units related to the systemd
  service definition for radosgw.service.

  If you look through this pastebin, you'll notice that there is a
  running radosgw daemon and the local haproxy unit thinks all radosgw
  backend services are available (via nagios check), but systemd can't
  control radosgw properly (note that before a restart with systemd,
  systemd just showed the unit as loaded inactive, however, it now shows
  active exited, but that did not actually restart the radosgw service.

  https://pastebin.ubuntu.com/p/Pn3sQ3zHXx/

  charm: cs:ceph-radosgw-266
  cloud:bionic-rocky
   *** 13.2.4+dfsg1-0ubuntu0.18.10.1~cloud0 500
          500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/rocky/main amd64 Packages

  ceph-radosgw/0                    active    idle   18/lxd/2  10.20.175.60    80/tcp                                   Unit is ready
    hacluster-radosgw/2             active    idle             10.20.175.60                                             Unit is ready and clustered
  ceph-radosgw/1                    active    idle   19/lxd/2  10.20.175.48    80/tcp                                   Unit is ready
    hacluster-radosgw/1             active    idle             10.20.175.48                                             Unit is ready and clustered
  ceph-radosgw/2*                   active    idle   20/lxd/2  10.20.175.25    80/tcp                                   Unit is ready
    hacluster-radosgw/0*            active    idle             10.20.175.25                                             Unit is ready and clustered

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-ceph-radosgw/+bug/1825843/+subscriptions