[Bug 1384062] Re: os-prober kills ceph OSD
Dr. Jens Rosenboom
j.rosenboom at x-ion.de
Wed Feb 17 12:10:19 UTC 2016
Forgot to mention that the Ceph cluster has to be under write load in
order to reproduce, i.e. running something like
rados -p rbd bench 600 write -t 1 --show-time --run-length 60
There is no effect of running os-prober if the cluster is idle. Based
with that information, though, I can also reproduce the issue by running
fio on some partition and os-prober in parallel, getting:
# fio --ioengine=libaio --filename=/dev/sdc4 --bs=64k --rw=randwrite --runtime=300 --size=1G --direct=1 --iodepth=8 --name=a
a: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=libaio, iodepth=8
fio-2.2.13
Starting 1 process
fio: io_u error on file /dev/sdc4: Operation not permitted: write offset=531300352, buflen=65536
fio: pid=17543, err=1/file:io_u.c:1596, func=io_u error, error=Operation not permitted
So I think the error has nothing to do with Ceph in particular, but
really os-prober should be made more conservative when trying to probe
partitions.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to os-prober in Ubuntu.
https://bugs.launchpad.net/bugs/1384062
Title:
os-prober kills ceph OSD
Status in ceph package in Ubuntu:
Confirmed
Status in os-prober package in Ubuntu:
Confirmed
Status in ceph package in Juju Charms Collection:
Invalid
Bug description:
This morning automatic package upgrade on our running system:
libsigc++-2.0-0c2a,libssl1.0.0,man-db,libgtk2.0-common,libgtk2.0-bin,
libgtk2.0-0,openssh-sftp-server,openssh-server,
openssh-client,grub-pc,grub-pc-bin,grub2-common,grub-common,openssl,python-cryptography,python-pygraphviz
killed five OSD out of 15 on our ceph 0.80.6 cluster of 5 machines :
root at g2:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.7.log
2014-10-22 07:41:36.783358 7f4d33d55700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
2014-10-22 07:41:36.783617 7f4d33d55700 -1 journal FileJournal::do_write: write_bl(pos=793935872) failed
2014-10-22 07:41:36.800201 7f4d33d55700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&)' thread 7f4d33d55700 time 2014-10-22 07:41:36.783629
2014-10-22 07:41:36.847389 7f4d33d55700 -1 *** Caught signal (Aborted) **
root at n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.10.log|cut -c-120
2014-10-22 07:42:18.169142 7f9b977df700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
root at n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.9.log|cut -c-120
2014-10-22 07:42:17.509579 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.13 since back 2014-10-22 07:41
2014-10-22 07:42:17.509593 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.14 since back 2014-10-22 07:41
2014-10-22 07:42:17.945433 7f6ef6a74700 -1 journal FileJournal::do_write: pwrite(fd=23, hbp.length=4096) failed :(1) Ope
2014-10-22 07:42:17.960678 7f6ef6a74700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&)
root at stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.13.log
2014-10-22 00:42:01.140574 7fa929b8a700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
2014-10-22 00:42:01.141439 7fa929b8a700 -1 journal FileJournal::do_write: write_bl(pos=3496448000) failed
root at stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.14.log
2014-10-22 00:41:54.828719 7f438eb45700 -1 osd.14 15388 heartbeat_check: no reply from osd.7 since back 2014-10-22 00:41:34.499777 front 2014-10-22 00:41:34.499777 (cutoff 2014-10-22 00:41:34.828717)
2014-10-22 00:41:55.241586 7f437217f700 0 -- 192.168.99.246:6811/17136 >> 192.168.99.253:6806/25800 pipe(0x7f439f5fd900 sd=182 :6811 s=0 pgs=0 cs=0 l=0 c=0x7f43a71f1180).accept connect_seq 34 vs existing 33 state standby
2014-10-22 00:42:01.235014 7f438b33e700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted
2014-10-22 00:42:01.235032 7f438b33e700 -1 journal FileJournal::do_write: write_bl(pos=4626878464) failed
The OSD all died just after a run of os-prober according to the logs:
Oct 22 07:41:36 g2 os-prober: debug: running /usr/lib/os-
probes/mounted/05efi on mounted /dev/sda1
os-prober likely did an operation on the journal partition causing the
write errors on the OSD.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1384062/+subscriptions
More information about the foundations-bugs
mailing list