[Bug 1825843] Re: systemd issues with bionic-rocky causing nagios alert and can't restart daemon

Tue Apr 23 00:50:28 UTC 2019

We upgraded the charms and radosgw got broken, ceph-radosgw release 267.

After hours of debugging, We decided to test a fresh 4 node deployment to investigate the problem and try to revert, deploying a fresh openstack base juju is showing:

ceph-radosgw/0*           blocked   idle       0/lxd/0  10.100.0.61     80/tcp                      Services not running that should be: ceph-radosgw at rgw.juju-168b18-0-lxd-0

if we do a restart to the lxd container, when we execute:

sudo service radosgw status

we get:

Apr 23 00:18:05 juju-168b18-0-lxd-0 radosgw[36885]: Starting client.rgw.juju-168b18-0-lxd-0...
Apr 23 00:18:05 juju-168b18-0-lxd-0 systemd[1]: Started LSB: radosgw RESTful rados gateway.
Apr 23 00:19:22 juju-168b18-0-lxd-0 systemd[1]: Stopping LSB: radosgw RESTful rados gateway...
Apr 23 00:19:22 juju-168b18-0-lxd-0 systemd[1]: Stopped LSB: radosgw RESTful rados gateway.
Apr 23 00:19:26 juju-168b18-0-lxd-0 systemd[1]: radosgw.service: Failed to reset devices.list: Operation not permitted
Apr 23 00:19:26 juju-168b18-0-lxd-0 systemd[1]: Starting LSB: radosgw RESTful rados gateway...
Apr 23 00:19:26 juju-168b18-0-lxd-0 radosgw[37618]: Starting client.rgw.juju-168b18-0-lxd-0...
Apr 23 00:19:26 juju-168b18-0-lxd-0 systemd[1]: Started LSB: radosgw RESTful rados gateway.
Apr 23 00:21:48 juju-168b18-0-lxd-0 systemd[1]: Stopping LSB: radosgw RESTful rados gateway...
Apr 23 00:21:49 juju-168b18-0-lxd-0 systemd[1]: Stopped LSB: radosgw RESTful rados gateway.

that's the output after a fresh boot, then if we do:

sudo service radosgw start

we get the service running:

● radosgw.service - LSB: radosgw RESTful rados gateway
   Loaded: loaded (/etc/init.d/radosgw; generated)
   Active: active (running) since Tue 2019-04-23 00:22:47 UTC; 17min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 811 ExecStart=/etc/init.d/radosgw start (code=exited, status=0/SUCCESS)
    Tasks: 582 (limit: 7372)
   CGroup: /system.slice/radosgw.service
           └─850 /usr/bin/radosgw -n client.rgw.juju-168b18-0-lxd-0

Apr 23 00:22:46 juju-168b18-0-lxd-0 systemd[1]: Starting LSB: radosgw RESTful rados gateway...
Apr 23 00:22:46 juju-168b18-0-lxd-0 radosgw[811]: Starting client.rgw.juju-168b18-0-lxd-0...
Apr 23 00:22:47 juju-168b18-0-lxd-0 systemd[1]: Started LSB: radosgw RESTful rados gateway.

but juju still keep showing the unit blocked.

This is the juju log for ceph-radosgw:

https://paste.ubuntu.com/p/kb3g9XZ7nb/

We are getting the same behaviour in our production and test
environment. Even if we get the service running, the unit doesn't seem
to work from the openstack perspective, e.g. try to create a bucket the
api doesn't connect.

How can I help?

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1825843

Title:
  systemd issues with bionic-rocky causing nagios alert and can't
  restart daemon

Status in OpenStack ceph-radosgw charm:
  Triaged
Status in ceph package in Ubuntu:
  New

Bug description:
  During deployment of a bionic-rocky cloud on 19.04 charms, we are
  seeing an issue with the ceph-radosgw units related to the systemd
  service definition for radosgw.service.

  If you look through this pastebin, you'll notice that there is a
  running radosgw daemon and the local haproxy unit thinks all radosgw
  backend services are available (via nagios check), but systemd can't
  control radosgw properly (note that before a restart with systemd,
  systemd just showed the unit as loaded inactive, however, it now shows
  active exited, but that did not actually restart the radosgw service.

  https://pastebin.ubuntu.com/p/Pn3sQ3zHXx/

  charm: cs:ceph-radosgw-266
  cloud:bionic-rocky
   *** 13.2.4+dfsg1-0ubuntu0.18.10.1~cloud0 500
          500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/rocky/main amd64 Packages

  ceph-radosgw/0                    active    idle   18/lxd/2  10.20.175.60    80/tcp                                   Unit is ready
    hacluster-radosgw/2             active    idle             10.20.175.60                                             Unit is ready and clustered
  ceph-radosgw/1                    active    idle   19/lxd/2  10.20.175.48    80/tcp                                   Unit is ready
    hacluster-radosgw/1             active    idle             10.20.175.48                                             Unit is ready and clustered
  ceph-radosgw/2*                   active    idle   20/lxd/2  10.20.175.25    80/tcp                                   Unit is ready
    hacluster-radosgw/0*            active    idle             10.20.175.25                                             Unit is ready and clustered

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-ceph-radosgw/+bug/1825843/+subscriptions