[Bug 1354114] Re: multipath segmentation Fault (libmultipath: update waiter handling)
Rafael David Tinoco
rafael.tinoco at canonical.com
Tue Oct 14 19:50:16 UTC 2014
# Precise failover test - removed 1 path during writes. failover after
10 seconds (iscsi timeout)
root at tryprecise:~# dd if=/dev/zero of=/dev/mapper/2luns0 bs=1M count=2000
2000+0 records in
2000+0 records out
2097152000 bytes (2.1 GB) copied, 137.125 s, 15.3 MB/s
ct 14 16:46:01 tryprecise kernel: [ 2000.944119] connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295390028, last ping 4295391280, now 4295392532
Oct 14 16:46:01 tryprecise kernel: [ 2000.945011] connection2:0: detected conn error (1011)
Oct 14 16:46:01 tryprecise kernel: [ 2000.945014] connection6:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295390028, last ping 4295391280, now 4295392532
Oct 14 16:46:01 tryprecise kernel: [ 2000.945884] connection6:0: detected conn error (1011)
Oct 14 16:46:01 tryprecise iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3)
Oct 14 16:46:01 tryprecise iscsid: Kernel reported iSCSI connection 6:0 error (1011) state (3)
Oct 14 16:46:05 tryprecise kernel: [ 2005.232037] connection4:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295391098, last ping 4295392352, now 4295393604
Oct 14 16:46:05 tryprecise kernel: [ 2005.232944] connection4:0: detected conn error (1011)
Oct 14 16:46:06 tryprecise iscsid: Kernel reported iSCSI connection 4:0 error (1011) state (3)
Oct 14 16:46:37 tryprecise iscsid: connect to 172.16.22.51:3260 failed (No route to host)
Oct 14 16:47:38 iscsid: last message repeated 30 times
Oct 14 16:48:01 iscsid: last message repeated 11 times
Oct 14 16:48:01 tryprecise kernel: [ 2121.440066] session2: session recovery timed out after 120 secs
Oct 14 16:48:01 tryprecise kernel: [ 2121.440077] session6: session recovery timed out after 120 secs
Oct 14 16:48:02 tryprecise iscsid: connect to 172.16.22.51:3260 failed (No route to host)
Oct 14 16:48:05 iscsid: last message repeated 2 times
Oct 14 16:48:05 tryprecise kernel: [ 2125.792083] session4: session recovery timed out after 120 secs
Oct 14 16:48:06 tryprecise kernel: [ 2125.800111] sd 5:0:0:0: rejecting I/O to offline device
Oct 14 16:48:06 tryprecise kernel: [ 2125.800530] sd 5:0:0:0: [sdg] killing request
Oct 14 16:48:06 tryprecise kernel: [ 2125.800540] sd 5:0:0:0: rejecting I/O to offline device
...
Oct 14 16:48:06 tryprecise kernel: [ 2126.154444] Write(10): 2a 00 00 2d 3c 00 00 04 00 00
Oct 14 16:48:06 tryprecise kernel: [ 2126.154593] device-mapper: multipath: Failing path 8:112.
Oct 14 16:48:06 tryprecise multipathd: 2luns0: sdg - readsector0 checker reports path is down
Oct 14 16:48:06 tryprecise multipathd: checker failed path 8:96 in map 2luns0
Oct 14 16:48:06 tryprecise multipathd: 2luns0: remaining active paths: 1
Oct 14 16:48:06 tryprecise multipathd: 2luns1: sdh - readsector0 checker reports path is up
Oct 14 16:48:06 tryprecise multipathd: checker failed path 8:112 in map 2luns1
Oct 14 16:48:06 tryprecise multipathd: 2luns1: remaining active paths: 1
Oct 14 16:48:06 tryprecise multipathd: dm-4: add map (uevent)
Oct 14 16:48:06 tryprecise multipathd: dm-4: devmap already registered
Oct 14 16:48:06 tryprecise multipathd: dm-3: add map (uevent)
Oct 14 16:48:06 tryprecise multipathd: dm-3: devmap already registered
Oct 14 16:48:07 tryprecise multipathd: lun1: sdc - readsector0 checker reports path is up
Oct 14 16:48:07 tryprecise multipathd: checker failed path 8:32 in map lun1
Oct 14 16:48:07 tryprecise multipathd: lun1: remaining active paths: 1
Oct 14 16:48:07 tryprecise multipathd: lun2: sde - readsector0 checker reports path is up
Oct 14 16:48:07 tryprecise multipathd: checker failed path 8:64 in map lun2
Oct 14 16:48:07 tryprecise multipathd: lun2: remaining active paths: 1
Oct 14 16:48:07 tryprecise kernel: [ 2127.154798] device-mapper: multipath: Failing path 8:32.
Oct 14 16:48:07 tryprecise kernel: [ 2127.154849] device-mapper: multipath: Failing path 8:64.
Oct 14 16:48:07 tryprecise multipathd: dm-1: add map (uevent)
Oct 14 16:48:07 tryprecise multipathd: dm-1: devmap already registered
Oct 14 16:48:07 tryprecise multipathd: dm-2: add map (uevent)
Oct 14 16:48:07 tryprecise multipathd: dm-2: devmap already registered
Oct 14 16:48:08 tryprecise iscsid: connect to 172.16.22.51:3260 failed (No route to host)
Oct 14 16:48:09 iscsid: last message repeated 2 times
Oct 14 16:48:09 tryprecise multipathd: 2luns0: sdg - readsector0 checker reports path is down
Oct 14 16:48:09 tryprecise multipathd: 2luns1: sdh - readsector0 checker reports path is up
Oct 14 16:48:10 tryprecise multipathd: lun1: sdc - readsector0 checker reports path is up
Oct 14 16:48:10 tryprecise multipathd: lun2: sde - readsector0 checker reports path is up
Oct 14 16:48:12 tryprecise multipathd: 2luns0: sdg - readsector0 checker reports path is down
Oct 14 16:48:12 tryprecise multipathd: 2luns1: sdh - readsector0 checker reports path is up
Oct 14 16:48:13 tryprecise multipathd: lun1: sdc - readsector0 checker reports path is up
Oct 14 16:48:13 tryprecise multipathd: lun2: sde - readsector0 checker reports path is up
Oct 14 16:48:14 tryprecise iscsid: connect to 172.16.22.51:3260 failed (No route to host)
Oct 14 16:48:15 iscsid: last message repeated 2 times
Oct 14 16:48:15 tryprecise multipathd: 2luns0: sdg - readsector0 checker reports path is down
Oct 14 16:48:15 tryprecise multipathd: 2luns1: sdh - readsector0 checker reports path is up
Oct 14 16:48:16 tryprecise multipathd: lun1: sdc - readsector0 checker reports path is up
Oct 14 16:48:16 tryprecise multipathd: lun2: sde - readsector0 checker reports path is up
Oct 14 16:48:18 tryprecise multipathd: 2luns0: sdg - readsector0 checker reports path is down
Oct 14 16:48:18 tryprecise multipathd: 2luns1: sdh - readsector0 checker reports path is up
Oct 14 16:48:19 tryprecise multipathd: lun1: sdc - readsector0 checker reports path is up
Oct 14 16:48:19 tryprecise multipathd: lun2: sde - readsector0 checker reports path is up
Oct 14 16:48:20 tryprecise iscsid: connect to 172.16.22.51:3260 failed (No route to host)
Oct 14 16:48:21 iscsid: last message repeated 2 times
...
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to multipath-tools in Ubuntu.
https://bugs.launchpad.net/bugs/1354114
Title:
multipath segmentation Fault (libmultipath: update waiter handling)
Status in “multipath-tools” package in Ubuntu:
Fix Released
Status in “multipath-tools” source package in Precise:
Fix Committed
Status in “multipath-tools” source package in Trusty:
Fix Committed
Status in “multipath-tools” package in Debian:
Fix Released
Bug description:
[Impact]
* Multipath can cause segmentation fault due to wrong code and can
possibly cause user to loose access to multipath devices.
[Test Case]
* To use multipath and wait for the problem to occur sometime
(inevitable).
[Regression Potential]
* Patch 1/4 tries to fix the issue. Patch 2/4 fixes the 1/4.
* Patch 3/4 discovers 1/4 was no good. Patch 4/4 fixes 3/4.
* Fix based on upstream code (96f8146) + subsequent patches.
* Followed this code development until the issue was addressed.
[Other Info]
* Original bug description:
----------------
It was brought to me (~inaddy) the following situation with
multipathd:
#####
Program terminated with signal 6, Aborted.
#0 0x00007fbc6ae09445 in raise () from /lib/x86_64linuxgnu/
libc.so.6
(gdb) bt
#0 0x00007fbc6ae09445 in raise () from /lib/x86_64linuxgnu/
libc.so.6
#1 0x00007fbc6ae0cbab in abort () from /lib/x86_64linuxgnu/
libc.so.6
#2 0x00007fbc6ae0210e in ?? () from /lib/x86_64linuxgnu/
libc.so.6
#3 0x00007fbc6ae021b2 in __assert_fail () from /lib/x86_64linuxgnu/
libc.so.6
#4 0x00007fbc6b849efb in pthread_mutex_lock () from /lib/x86_64linuxgnu/
libpthread.so.0
#5 0x00007fbc6b1cba5f in free_waiter (data=0x1691de0) at waiter.c:44
#6 0x00007fbc6b1cc25a in waitevent (et=0x1691de0) at waiter.c:204
#7 0x00007fbc6b847e9a in start_thread () from /lib/x86_64linuxgnu/
libpthread.so.0
#8 0x00007fbc6aec54bd in clone () from /lib/x86_64linuxgnu/
libc.so.6
#9 0x0000000000000000 in ?? ()
--------------------------------------------------------------------------------------------
#5 0x00007fbc6b1cba5f in free_waiter (data=0x1691de0) at waiter.c:44
44 lock(wp>
vecs>
lock);
(gdb) print wp>
vecs>
lock
$1 = {mutex = 0x168c280, depth = 1}
In pthread_mutex_lock.c:62 there's an assert that fails:
#4 0x00007fbc6b849efb in __pthread_mutex_lock (mutex=0xfefefefefefefeff) at pthread_mutex_lock.c:62
62 assert (mutex>_
data._owner == 0);
In this run:
(gdb) p *wp>
vecs>
lock>
mutex
$3 = {_data = {lock = 1, __count = 0, __owner = 49, __nusers = 0, __kind = 0, __spins = 0, __list = {_prev = 0x0, __next = 0xffffffff}},
__size = "\001\000\000\000\000\000\000\000\061", '\000' <repeats 23 times>"\377, \377\377\377\000\000\000", __align = 1}
so __owner is 49 and not 0.
Note that 49 is somewhat strange; it's expected to be a pid_t obtained via
pid_t id = THREAD_GETMEM (THREAD_SELF, tid);
According to https://bugzilla.redhat.com/show_bug.cgi?id=570278 , this
assert failure could be an expected behaviour if, for some reason the
multipath code was trying to release a mutex that has already been
freed.
The multipath-tools package is up to date (0.4.9-3ubuntu5)
I do not find obvious thing related in http://git.opensvc.com/gitweb.cgi?p=multipath-tools%2F.git except may be
http://git.opensvc.com/gitweb.cgi?p=multipath-tools/.git;a=commitdiff;h=5ee9f716549d913aeb9800041f78ee9c6a50d860
#####
In between Precise's version and Upstream there are the following
patches touching waiter.c:
d887f4a = signal waiter thread to stop waiting on dm events
5ee9f71 = simplify multipath signal handlers
af4fd6d = Fix race condition in stop_waiter_thread()
e1fcc59 = multipath: clean up code for stopping the waiter threads
03ec4ef = multipath: fix shutdown crashes
4dfdaf2 = multipath: Update multipath device on show topology
c301a3f = Race condition when calling stop_waiter_thread()
96f8146 = libmultipath: update waiter handling
This specific one: 96f8146 (libmultipath: update waiter handling)
"""
The current 'waiter' structure accesses fields which belong
to the main 'mpp' structure, which has a totally different
lifetime.
"""
Shows that due to different lifetime between different structures,
there can be use-after-free segfaults (what seems to be happening).
waiter.c:44 = lock(wp->vecs->lock);
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1354114/+subscriptions
More information about the foundations-bugs
mailing list