[Bug 1560710] Re: ISST-SAN:KVM:R3-0: LVM udev rules deadlock with large number of PVs

Wed Mar 30 08:50:33 UTC 2016

I thought about such a "master/slave" implementation in watershed for a
while, and this is a bit tricky. The "slaves" need to communicate their
desire to run the command to the master, and both the master and slaves
need to use locking to ensure that only one instance is the master and
runs the command. However, the acts of "(try to) acquire the lock" and
"refresh the desire to run the command" need to happen atomically,
otherwise race conditions occur. E. g. in a slave instance some time
might pass between the failure to acquire the lock and
increasing/refreshing the stamp, and if the master finishes in between
that last request is lost.

The standard tool for such a race free and atomic counter which is
simultaneously a lock is a semaphore. However, both named and unnamed
POSIX semaphores rely on /dev/shm/, and we cannot rely on that in udev
rules (no /dev/shm in the initrd). We can use semaphores if we can
assert that the udev rule is not crucial during early boot. I *think* it
should be okay as it gets re-exercised after pivoting and everything
gets mounted.

I wanted to compare the hotplug behaviour under Debian and Ubuntu. These commands can be run in a minimal VM:

   apt-get install -y lvm2
   reboot  # lvm daemons don't seem to start right after installation; get a clean slate

   modprobe scsi_debug
   pvcreate /dev/sda
   vgcreate testvg /dev/sda
   lvcreate -L 4MB testvg

Now we have one PV, VG, and LV each, and an usable block device:

   lrwxrwxrwx 1 root root 7 Mar 30 08:01 /dev/testvg/lvol0 -> ../dm-0

Let's hot-remove the device. This does not automatically clean up the
mapped device, so do this manually:

   echo  1 > /sys/block/sda/device/delete
   dmsetup remove /dev/testvg/lvol0

Now hotplug back the block device:

   echo '0 0 0'  > /sys/class/scsi_host/host2/scan

Under *both* Debian and Ubuntu this correctly brings up the PV, VG, and
LV, and /dev/testvg/lvol0 exists again. I can even remove our udev rule
85-lvm2.rules, update the initrd, reboot, and run the above test.

Thus it seems our Ubuntu specific udev rule is entirely obsolete. Indeed
these days /lib/udev/rules.d/69-lvm-metad.rules (which calls pvscan
--cache --activate) and lvmetad seem to be responsible for that, see
/usr/share/doc/lvm2/udev_assembly.txt. So it seems we are now just doing
extra work for no benefit.

I also noticed this in our Ubuntu delta description:

        - do not install activation systemd generator for lvm2, since
udev starts LVM.

The activation generator is relevant if the admin disabled lvmetad, then
the generator builds up the VGs at boot time. It's  a no-op if lvmetad
is enabled. We should put that back to match the current documentation
and reduce our delta.

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to watershed in Ubuntu.
https://bugs.launchpad.net/bugs/1560710

Title:
  ISST-SAN:KVM:R3-0: LVM udev rules deadlock with large number of PVs

Status in lvm2 package in Ubuntu:
  Triaged
Status in watershed package in Ubuntu:
  In Progress

Bug description:
  Oringinal problem statement:

  Today I reinstalled lucky03.Installation went fine.After Installation
  it got rebooted and after reboot ,Unable to get the login prompt

  == Comment: #3 - Kevin W. Rudd - 2016-03-14 18:49:02 ==
  It looks like this might be related to bug 124628 .  I was able to get to a login prompt by adding the following to the boot options:

  udev.children-max=500

  Lekshmi,

  Can you provide additional information on how you did the install for
  this lpar?  It would be nice to replicate the exact install sequence
  from the beginning in order to try to capture some additional debug
  information.

  == Comment: #18 - Mauricio Faria De Oliveira - 2016-03-22 16:59:40 ==
  It's possible to reproduce this on a qemu-kvm guest w/ emulated hard disks (i.e., image/file-backed disks).

  Configuration:
  - 1 disk w/ 16.04 (no LVM required)
  - 50 disks (w/ LVM volumes)

  # ps l
  ...
  S     0  7014   145  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7015   144  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7016   140  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7017   139  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7018   142  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7019  7014  3520  1280 0:0   20:53 00:00:00 sh -c /sbin/lvm vgscan; /
  S     0  7020   137  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7021   143  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7023   136  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7024   141  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7025   138  3008  1216 0:0   20:53 00:00:00 /lib/udev/watershed sh -c
  S     0  7026  7019 10560  9344 0:0   20:53 00:00:00 /sbin/lvm vgchange -a y
  ...

  # cat /proc/7014/stack
  [<c0000000034e3b20>] 0xc0000000034e3b20
  [<c000000000015ce8>] __switch_to+0x1f8/0x350
  [<c0000000000bc3ac>] do_wait+0x22c/0x2d0
  [<c0000000000bd998>] SyS_wait4+0xa8/0x140
  [<c000000000009204>] system_call+0x38/0xb4

  # cat /proc/7019/stack
  [<c0000000031cfb20>] 0xc0000000031cfb20
  [<c000000000015ce8>] __switch_to+0x1f8/0x350
  [<c0000000000bc3ac>] do_wait+0x22c/0x2d0
  [<c0000000000bd998>] SyS_wait4+0xa8/0x140
  [<c000000000009204>] system_call+0x38/0xb4

  # cat /proc/7026/stack
  [<c0000000031aba80>] 0xc0000000031aba80
  [<c000000000015ce8>] __switch_to+0x1f8/0x350
  [<c000000000463160>] SyS_semtimedop+0x810/0x9f0
  [<c0000000004661d4>] SyS_ipc+0x154/0x3c0
  [<c000000000009204>] system_call+0x38/0xb4

  # dmsetup udevcookies
  Cookie       Semid      Value      Last semop time           Last change time
  0xd4d888a    0          1          Tue Mar 22 20:53:55 2016  Tue Mar 22 20:53:55 2016

  == Comment: #19 - Mauricio Faria De Oliveira - 2016-03-22 17:00:13 ==
  Command to create the LVM volumes on initramfs:

  # for sd in /sys/block/sd*; do sd="$(basename $sd)"; [ "$sd" = 'sda' ]
  && continue; lvm pvcreate /dev/$sd; lvm vgcreate vg-$sd /dev/$sd; lvm
  lvcreate --size 1000m --name lv-$sd vg-$sd; done

  # lvm vgdisplay | grep -c 'VG Name'
  50

  == Comment: #20 - Mauricio Faria De Oliveira - 2016-03-22 17:57:50 ==
  Hm, got a better picture of this:

  The problem doesn't seem to be a synchronization issue.
  I've learned a bit more about the udev events/cookies for sdX and lvm volumes.

  1) The sdX add events happen, for multiple devices.  
      Each device consumes 1 udev worker.

  2) The sdX add events run 'watershed .. vgchange -a y...'.
       If this detects an LVM volume in sdX, it will try  to activate it, and then block/wait for the respective LVM/udev cookie to complete (i.e., wait for the LVM dm-X device add event to finish)

  3) The dm-X device add event is fired from the kernel.

  4) There are no available udev workers to process it.
      The event processing remains queued.
      Thus, the cookie will not be released.

  5) No udev workers from sdX devices will finish, since all are waiting
  for cookies to be complete, which demand available udev workers.

  == Comment: #21 - Mauricio Faria De Oliveira - 2016-03-22 18:02:11 ==
  Got a confirmation of the previous hypothesis.

  Added the --noudevsync argument to the vgchange command in the initramfs's /lib/udev/rules.d/85-lvm2.rules.
  This causes vgchange not to wait for an udev cookie.

  Things didn't block, actually finished quite fast.

  == Comment: #23 - Mauricio Faria De Oliveira - 2016-03-22 18:14:41 ==
  Hi Canonical,

  May you please help in finding a solution to this problem?
  If I recall correctly, @pitti works w/ udev and early boot in general.

  The problem summary (from previous comments) is:

  1) There are more SCSI disks w/ LVM volumes present than the max
  number of udev workers. Each disk consumes 1 udev workers.

  2) When the add uevent from each disk runs 85-lvm2.rules, the call to
  'vgchange -a y' will detect LVM volume(s) and activate them. This
  fires an add uevent for a dm-X device from the kernel. And vgchange
  blocks waiting for the respective udev cookie to be completed.

  3) The add uevent for dm-X has no udev workers to run on (all taken by
  the SCSI disks, which are blocked on calls to vgchange (or watershed,
  which is also blocked/waiting for one vgchange to finish), and thus
  the udev cookie related to dm-X will not be completed.

  4) If that cookie is not completed, that vgchange won't finish either.

  It's a deadlock, afaik.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/1560710/+subscriptions