[Bug 1540407] Re: multipathd drops paths of a temporarily lost device

Wed Feb 24 20:36:30 UTC 2016

Hi Thorsten,

The latest version of the multipath-tools package is  0.5.0-7ubuntu14.
Can you confirm you're still seeing the issue?

Meanwhile, I'm hoping to recreate this issue on zKVM shortly.  In the
meantime I'm testing this on in an x86 VM with multipath via virtio-
scsi, using the same multipath.conf as mentioned in the bug.

# dpkg -s multipath-tools| grep ^Version
Version: 0.5.0-7ubuntu14

# multipath -ll
mpatha (0QEMU_QEMU_HARDDISK_0001) dm-0 QEMU,QEMU HARDDISK
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 2:0:0:0 sda 8:0  active ready running
  `- 2:0:0:1 sdb 8:16 active ready running

I can mark the device offline with:

# echo "offline" > /sys/class/block/sdb/device/state
# multipath -ll
mpatha (0QEMU_QEMU_HARDDISK_0001) dm-0 QEMU,QEMU HARDDISK
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 2:0:0:0 sda 8:0  active ready  running
  `- 2:0:0:1 sdb 8:16 active faulty offline

# sleep 60 && multipath -ll 
mpatha (0QEMU_QEMU_HARDDISK_0001) dm-0 QEMU,QEMU HARDDISK
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 2:0:0:0 sda 8:0  active ready  running
  `- 2:0:0:1 sdb 8:16 failed faulty offline

And bring it back online with:

# echo "1" > /sys/block/sdb/device/delete
# for RESCAN in /sys/class/scsi_host/*; do echo "- - -" > $RESCAN/scan; done 
# multipath -v2 
# multipath -ll
mpatha (0QEMU_QEMU_HARDDISK_0001) dm-0 QEMU,QEMU HARDDISK
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 2:0:0:0 sda 8:0  active ready running
  `- 2:0:0:1 sdb 8:16 active ready running

It maybe that the underlying device would need to be in error instead of the scsi layer in the kernel.
I'll update this when I get the zKVM instance up with multipath as described in the bug.

Looking at the delta between 0.5.0 in Ubuntu and the newer version in
Debian, there are a number of changes in the area of discovery and path
checking which may resolve this issue if we can confirm we're still
seeing the issue on the latest version in Xenial.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to multipath-tools in Ubuntu.
https://bugs.launchpad.net/bugs/1540407

Title:
  multipathd drops paths of a temporarily lost device

Status in multipath-tools package in Ubuntu:
  New

Bug description:
  == Comment: #0 - Thorsten Diehl <thorsten.diehl at de.ibm.com> - 2016-02-01 08:57:28 ==
  # uname -a
  Linux s83lp31 4.4.0-1-generic #15-Ubuntu SMP Thu Jan 21 22:19:04 UTC 2016 s390x s390x s390x GNU/Linux

  # dpkg -s multipath-tools|grep ^Version:
  Version: 0.5.0-7ubuntu9

  # cat /etc/multipath.conf
  defaults {
      default_features "1 queue_if_no_path"
      user_friendly_names yes
      path_grouping_policy multibus
      dev_loss_tmo 2147483647
      fast_io_fail_tmo 5
  }

  blacklist {
      devnode '*'
  }

  blacklist_exceptions {
      devnode "^sd[a-z]+"
  }

  ---------------------------------------
  On a z Systems LPAR with a single LUN, 2 zfcp devices, 2 storage ports, and the following multipath topology:

  mpatha (36005076304ffc3e80000000000003050) dm-0 IBM,2107900
  size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
  `-+- policy='round-robin 0' prio=1 status=active
    |- 0:0:0:1079001136 sda 8:0  active ready running
    |- 0:0:1:1079001136 sdb 8:16 active ready running
    |- 1:0:0:1079001136 sdc 8:32 active ready running
    `- 1:0:1:1079001136 sdd 8:48 active ready running

  I observed the following:
  When I deconfigure one of the two zfcp devices (e.g. via chchp -c 0, or directly on the HMC), the multipathd removes the two paths via these devices from the pathgroup after 10 seconds. When the zfcp devices comes back, it runs through zfcp error recovery and is being set up properly, and also the mid layer objects are looking fine. However, the multipathd does not add them to the path group again.

  Expected behaviour: multipathd does not remove the paths from topology
  list, but holds them as "failed faulty offline" until dev_loss_tmo
  timout is reached (which is infinite here).

  I discussed this already with zfcp development, and it looks most
  likely as a problem with multipathd, rather than zfcp or mid-layer.

  Easy to reproduce: you need two zfcp devices, one LUN, and preferably
  two ports on the storage server (WWPNs). Configure LUN via 2 zfcp
  devices * 2 WWPNs = 4 paths.

  This can be also reproduced on a z/VM guest. Instead of configuing the
  CHPID off, just detach one zfcp device and re-attach it after 30....60
  seconds. Same problem.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1540407/+subscriptions