[Bug 1384062] Re: os-prober kills ceph OSD

Dr. Jens Rosenboom j.rosenboom at x-ion.de
Tue Feb 16 15:22:24 UTC 2016


I can reproduce this on Wily with

# apt-cache policy ceph
ceph:
  Installed: 0.94.5-0ubuntu0.15.10.1
  Candidate: 0.94.5-0ubuntu0.15.10.1
  Version table:
 *** 0.94.5-0ubuntu0.15.10.1 0
        500 http://eu.archive.ubuntu.com/ubuntu/ wily-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     0.94.3-0ubuntu2 0
        500 http://eu.archive.ubuntu.com/ubuntu/ wily/main amd64 Packages

when I configure two OSDs both with their journal partition on a third
disk. Running os-prober makes the second OSD fail with

     0> 2016-02-16 15:15:17.604773 7fe2c7ec8700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7fe2c7ec8700 time 2016-02-16 15:15:17.603014
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x556db4ce44f0]
 2: (FileJournal::write_finish_thread_entry()+0x775) [0x556db4bb5925]
 3: (FileJournal::WriteFinisher::entry()+0xd) [0x556db4a990ed]
 4: (()+0x76aa) [0x7fe2d32656aa]
 5: (clone()+0x6d) [0x7fe2d173ceed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

which seems to indicate that os-prober is doing unsafe things to the
journal partitions.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ceph in Ubuntu.
https://bugs.launchpad.net/bugs/1384062

Title:
  os-prober kills ceph OSD

Status in ceph package in Ubuntu:
  Confirmed
Status in ceph package in Juju Charms Collection:
  Invalid

Bug description:
  This morning automatic package upgrade on our running system:

  libsigc++-2.0-0c2a,libssl1.0.0,man-db,libgtk2.0-common,libgtk2.0-bin,
  libgtk2.0-0,openssh-sftp-server,openssh-server,
  openssh-client,grub-pc,grub-pc-bin,grub2-common,grub-common,openssl,python-cryptography,python-pygraphviz

  
  killed five OSD out of 15 on our  ceph 0.80.6 cluster of 5 machines :

  root at g2:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.7.log
  2014-10-22 07:41:36.783358 7f4d33d55700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
  2014-10-22 07:41:36.783617 7f4d33d55700 -1 journal FileJournal::do_write: write_bl(pos=793935872) failed
  2014-10-22 07:41:36.800201 7f4d33d55700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&)' thread 7f4d33d55700 time 2014-10-22 07:41:36.783629
  2014-10-22 07:41:36.847389 7f4d33d55700 -1 *** Caught signal (Aborted) **

  root at n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.10.log|cut -c-120
  2014-10-22 07:42:18.169142 7f9b977df700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted

  root at n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.9.log|cut -c-120
  2014-10-22 07:42:17.509579 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.13 since back 2014-10-22 07:41
  2014-10-22 07:42:17.509593 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.14 since back 2014-10-22 07:41
  2014-10-22 07:42:17.945433 7f6ef6a74700 -1 journal FileJournal::do_write: pwrite(fd=23, hbp.length=4096) failed :(1) Ope
  2014-10-22 07:42:17.960678 7f6ef6a74700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&)

  root at stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.13.log
  2014-10-22 00:42:01.140574 7fa929b8a700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
  2014-10-22 00:42:01.141439 7fa929b8a700 -1 journal FileJournal::do_write: write_bl(pos=3496448000) failed

  root at stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.14.log
  2014-10-22 00:41:54.828719 7f438eb45700 -1 osd.14 15388 heartbeat_check: no reply from osd.7 since back 2014-10-22 00:41:34.499777 front 2014-10-22 00:41:34.499777 (cutoff 2014-10-22 00:41:34.828717)
  2014-10-22 00:41:55.241586 7f437217f700  0 -- 192.168.99.246:6811/17136 >> 192.168.99.253:6806/25800 pipe(0x7f439f5fd900 sd=182 :6811 s=0 pgs=0 cs=0 l=0 c=0x7f43a71f1180).accept connect_seq 34 vs existing 33 state standby
  2014-10-22 00:42:01.235014 7f438b33e700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
  2014-10-22 00:42:01.235032 7f438b33e700 -1 journal FileJournal::do_write: write_bl(pos=4626878464) failed

  The OSD all died just after a run of os-prober according to the logs:

  Oct 22 07:41:36 g2 os-prober: debug: running /usr/lib/os-
  probes/mounted/05efi on mounted /dev/sda1

  os-prober likely did an operation on the journal partition causing the
  write errors on the OSD.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1384062/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list