[Bug 1384062] Re: os-prober kills ceph OSD

Dr. Jens Rosenboom j.rosenboom at x-ion.de
Wed Feb 17 12:10:19 UTC 2016


Forgot to mention that the Ceph cluster has to be under write load in
order to reproduce, i.e. running something like

rados -p rbd bench 600 write -t 1 --show-time --run-length 60

There is no effect of running os-prober if the cluster is idle. Based
with that information, though, I can also reproduce the issue by running
fio on some partition and os-prober in parallel, getting:

# fio --ioengine=libaio --filename=/dev/sdc4 --bs=64k --rw=randwrite --runtime=300 --size=1G --direct=1 --iodepth=8 --name=a
a: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio, iodepth=8
fio-2.2.13
Starting 1 process
fio: io_u error on file /dev/sdc4: Operation not permitted: write offset=531300352, buflen=65536
fio: pid=17543, err=1/file:io_u.c:1596, func=io_u error, error=Operation not permitted

So I think the error has nothing to do with Ceph in particular, but
really os-prober should be made more conservative when trying to probe
partitions.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to os-prober in Ubuntu.
https://bugs.launchpad.net/bugs/1384062

Title:
  os-prober kills ceph OSD

Status in ceph package in Ubuntu:
  Confirmed
Status in os-prober package in Ubuntu:
  Confirmed
Status in ceph package in Juju Charms Collection:
  Invalid

Bug description:
  This morning automatic package upgrade on our running system:

  libsigc++-2.0-0c2a,libssl1.0.0,man-db,libgtk2.0-common,libgtk2.0-bin,
  libgtk2.0-0,openssh-sftp-server,openssh-server,
  openssh-client,grub-pc,grub-pc-bin,grub2-common,grub-common,openssl,python-cryptography,python-pygraphviz

  
  killed five OSD out of 15 on our  ceph 0.80.6 cluster of 5 machines :

  root at g2:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.7.log
  2014-10-22 07:41:36.783358 7f4d33d55700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
  2014-10-22 07:41:36.783617 7f4d33d55700 -1 journal FileJournal::do_write: write_bl(pos=793935872) failed
  2014-10-22 07:41:36.800201 7f4d33d55700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&)' thread 7f4d33d55700 time 2014-10-22 07:41:36.783629
  2014-10-22 07:41:36.847389 7f4d33d55700 -1 *** Caught signal (Aborted) **

  root at n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.10.log|cut -c-120
  2014-10-22 07:42:18.169142 7f9b977df700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted

  root at n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.9.log|cut -c-120
  2014-10-22 07:42:17.509579 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.13 since back 2014-10-22 07:41
  2014-10-22 07:42:17.509593 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.14 since back 2014-10-22 07:41
  2014-10-22 07:42:17.945433 7f6ef6a74700 -1 journal FileJournal::do_write: pwrite(fd=23, hbp.length=4096) failed :(1) Ope
  2014-10-22 07:42:17.960678 7f6ef6a74700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&)

  root at stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.13.log
  2014-10-22 00:42:01.140574 7fa929b8a700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
  2014-10-22 00:42:01.141439 7fa929b8a700 -1 journal FileJournal::do_write: write_bl(pos=3496448000) failed

  root at stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.14.log
  2014-10-22 00:41:54.828719 7f438eb45700 -1 osd.14 15388 heartbeat_check: no reply from osd.7 since back 2014-10-22 00:41:34.499777 front 2014-10-22 00:41:34.499777 (cutoff 2014-10-22 00:41:34.828717)
  2014-10-22 00:41:55.241586 7f437217f700  0 -- 192.168.99.246:6811/17136 >> 192.168.99.253:6806/25800 pipe(0x7f439f5fd900 sd=182 :6811 s=0 pgs=0 cs=0 l=0 c=0x7f43a71f1180).accept connect_seq 34 vs existing 33 state standby
  2014-10-22 00:42:01.235014 7f438b33e700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
  2014-10-22 00:42:01.235032 7f438b33e700 -1 journal FileJournal::do_write: write_bl(pos=4626878464) failed

  The OSD all died just after a run of os-prober according to the logs:

  Oct 22 07:41:36 g2 os-prober: debug: running /usr/lib/os-
  probes/mounted/05efi on mounted /dev/sda1

  os-prober likely did an operation on the journal partition causing the
  write errors on the OSD.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1384062/+subscriptions



More information about the foundations-bugs mailing list