[Bug 2083061] Re: error deleting cloned volumes and parent at the same time when using ceph

Mon Nov 18 20:03:11 UTC 2024

** Also affects: ubuntu
   Importance: Undecided
       Status: New

** No longer affects: ubuntu

** Also affects: cinder (Ubuntu)
   Importance: Undecided
       Status: New

** Also affects: cinder (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Also affects: cloud-archive
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/bobcat
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
       Status: New

** Also affects: cloud-archive/antelope
   Importance: Undecided
       Status: New

** Summary changed:

- error deleting cloned volumes and parent at the same time when using ceph
+ [SRU] error deleting cloned volumes and parent at the same time when using ceph

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to cinder in Ubuntu.
https://bugs.launchpad.net/bugs/2083061

Title:
  [SRU] error deleting cloned volumes and parent at the same time when
  using ceph

Status in Cinder:
  New
Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in cinder package in Ubuntu:
  New
Status in cinder source package in Jammy:
  New

Bug description:
  Affects: bobcat and older

  A race condition when deleting cloned volumes at the same time as
  their parent results in the volumes in error_deleting state. The
  reason it happens is because the code that looks for the parent in [3]
  may find the original volume or the "<volume>.deleted" renamed volume
  if the parent has been marked for deletion. The race happens because
  by running the deletion of both the parent and the child at the same
  time, the child may see the parent volume before it is marked for
  deletion, and then in [4] it fails to find it again because it is gone
  (renamed to "<volume>.deleted").

  Steps to reproduce:

  1) openstack volume create --size 1 v1

  Wait for volume to be create and available

  2) for i in {1..9}; do openstack volume create d$i --source v1 --size
  1;done

  Wait for all volumes to be created and available

  3) openstack volume delete $(openstack volume list --format value -c
  ID | sort | xargs)

  Some volumes may be in error_deleting state.

  Workaround: Reset volume state and try to delete again.

  Solutions:

  a) The issue does not happen in caracal+ because of commit [1] which
  refactors the code. I tried to reproduce in Caracal with 50 volumes,
  including grandparent volumes, and I couldn't. If we could backport
  this fix as far back as Yoga this would address the problem for our
  users.

  b) A single line of code in [2] can address the problem in bobcat and
  older releases by adding a retry:

      @utils.retry(rbd.ImageNotFound, 2)
      def delete_volume(self, volume: Volume) -> None:

  The retry basically causes the ImageNotFound exception thrown at [4]
  to retry the delete_volume function, which will then find the
  "<volume>.deleted" at [3], solving the race condition. It is simpler
  than adding something more complex directly at [4] where the error
  happens.

  [1] https://github.com/openstack/cinder/commit/1a675c9aa178c6d9c6ed10fd98f086c46d350d3f

  [2]
  https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1371

  [3]
  https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1401

  [4]
  https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1337

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2083061/+subscriptions