[Bug 1724529] Re: ceph health output flips between OK and WARN all the time

Drew Freiberger 1724529 at bugs.launchpad.net
Thu Nov 30 18:56:10 UTC 2017


We are seeing this same flapping on another cloud.  One node had rebooted yesterday when the HEALTH_WARN flapping began.  Running on trusty/mitaka cloud archive 10.2.7 ceph package.
That server is giving the health_warn on the too many pgs.
Ohter server rebooted 7 days ago was not giving warning, and third server was still running ceph-mon from March at 10.2.3.  Had to kill ceph-mon (as /etc/init.d/ceph restart mon did not work) and that's now running 10.2.7 mon.
Some OSDs running 10.2.6, some running 10.2.7 when I run "ceph tell osd.* version"
restart of third mon (up for 10 days) (also required kill command) and now error is not flapping.

Seems there's something that either /etc/init.d/ceph command is not
properly allowing for mon restarts (on ceph-charm, not ceph-mon-charm)
when OSDs are present (though haven't tested w/out OSDs present).
Having to kill the process with standard SIG is odd to get the process
to recycle.  Perhaps it's being blocked by init daemon configs....side
issue.

I'm guessing what actually has happened is someone did a "ceph tell
mon.*" to ignore the pg counts, and then the restarts caused the setting
to be dropped.  This may be something to re-open against the ceph and
ceph-mon charms to allow for config opts for ceph health_warn configs,
or we can close this bug and open another.

The flapping makes so much more sense in this context of a ceph tell
mon.* having been run in the past.

We've got notes in a related case on another cloud to work-around this
with config-flags setting in the charm, but would love to see more of
these operational monitoring settings exposed by the charm directly
rather than relying on config-flags.

Here's the command to change on the live ceph-mons:
- ceph tell mon.* injectargs '--mon_pg_warn_max_per_osd=900'

Here's the command to configure the juju ceph charm to persist the setting:
- juju set ceph config-flags='{osd: {"mon pg warn max per osd": 900}}'

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1724529

Title:
  ceph health output flips between OK and WARN all the time

Status in OpenStack ceph-mon charm:
  Invalid
Status in ceph package in Ubuntu:
  Incomplete

Bug description:
  Bellow is an output from ceph health running in one second intervals.
  Although number of PGs does not change, output changes from HEALTH-OK
  to WARN and back all the time.

  https://pastebin.canonical.com/200887/

  Command bellow generates a PG distribution per OSD

  https://pastebin.canonical.com/200891/

  and here is output it created

  https://pastebin.canonical.com/200877/

  Running on Ubuntu 14.04
  charm version stable/17.02
  ceph version 10.2.7-0ubuntu0.16.04.1~cloud0

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-ceph-mon/+bug/1724529/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list