[Bug 2083061] Re: [SRU] error deleting cloned volumes and parent at the same time when using ceph
Rodrigo Barbieri
2083061 at bugs.launchpad.net
Wed Nov 20 15:46:49 UTC 2024
For reference this is the error and stack trace:
parent: 2078a5c0-4272-46f3-b95b-d89d62da67af
child: 12d02019-17b1-45a3-9026-aed36873edaf
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinder/volume/manager.py", line 981, in delete_volume
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server self.driver.delete_volume(volume)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinder/volume/drivers/rbd.py", line 1350, in delete_volume
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server self._delete_clone_parent_refs(client, parent, parent_snap)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/cinder/volume/drivers/rbd.py", line 1231, in _delete_clone_parent_refs
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server parent_rbd = self.rbd.Image(client.ioctx, parent_name)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server File "rbd.pyx", line 2896, in rbd.Image.__init__
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server rbd.ImageNotFound: [errno 2] RBD image not found (error opening image b'volume-2078a5c0-4272-46f3-b95b-d89d62da67af' at snapshot None)
2024-11-20 15:43:43.583 38606 ERROR oslo_messaging.rpc.server
2024-11-20 15:43:43.593 38606 DEBUG cinder.volume.drivers.rbd [req-5c4144c1-c310-4f05-ba15-b05dda6d61af 70958fca143047a583e91795ff460152 5c20c2e1c8ed4948923449807a40b3e7 - - -] volume is a clone so cleaning references delete_volume /usr/lib/python3/dist-packages/cinder/volume/drivers/rbd.py:1348
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to cinder in Ubuntu.
https://bugs.launchpad.net/bugs/2083061
Title:
[SRU] error deleting cloned volumes and parent at the same time when
using ceph
Status in Cinder:
New
Status in Ubuntu Cloud Archive:
New
Status in Ubuntu Cloud Archive antelope series:
New
Status in Ubuntu Cloud Archive bobcat series:
New
Status in Ubuntu Cloud Archive yoga series:
New
Status in cinder package in Ubuntu:
New
Status in cinder source package in Jammy:
New
Bug description:
******* SRU TEMPLATE AT THE BOTTOM **********
Affects: bobcat and older
A race condition when deleting cloned volumes at the same time as
their parent results in the volumes in error_deleting state. The
reason it happens is because the code that looks for the parent in [3]
may find the original volume or the "<volume>.deleted" renamed volume
if the parent has been marked for deletion. The race happens because
by running the deletion of both the parent and the child at the same
time, the child may see the parent volume before it is marked for
deletion, and then in [4] it fails to find it again because it is gone
(renamed to "<volume>.deleted").
Steps to reproduce:
1) openstack volume create --size 1 v1
Wait for volume to be create and available
2) for i in {1..9}; do openstack volume create d$i --source v1 --size
1;done
Wait for all volumes to be created and available
3) openstack volume delete $(openstack volume list --format value -c
ID | sort | xargs)
Some volumes may be in error_deleting state.
Workaround: Reset volume state and try to delete again.
Solutions:
a) The issue does not happen in caracal+ because of commit [1] which
refactors the code. I tried to reproduce in Caracal with 50 volumes,
including grandparent volumes, and I couldn't. If we could backport
this fix as far back as Yoga this would address the problem for our
users.
b) A single line of code in [2] can address the problem in bobcat and
older releases by adding a retry:
@utils.retry(rbd.ImageNotFound, 2)
def delete_volume(self, volume: Volume) -> None:
The retry basically causes the ImageNotFound exception thrown at [4]
to retry the delete_volume function, which will then find the
"<volume>.deleted" at [3], solving the race condition. It is simpler
than adding something more complex directly at [4] where the error
happens.
[1]
https://github.com/openstack/cinder/commit/1a675c9aa178c6d9c6ed10fd98f086c46d350d3f
[2]
https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1371
[3]
https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1401
[4]
https://github.com/openstack/cinder/blob/5b3717f8bfa69c142778ffeabfc4ab91f1f23581/cinder/volume/drivers/rbd.py#L1337
===================================================
SRU TEMPLATE
============
[Impact]
Due to a race condition, attempting to delete multiple volumes where
among them there is a parent and a child can result in one or more
volumes being stuck in error_deleting. The reason is because the
childs get updated as the parent is deleted, and if the code had
already started deleting the child then the reference changes halfway
through and fails. Later, the volumes can still be deleted by
resetting the state and retrying, but the user experience is
cumbersome.
Upstream has fixed the issue in Caracal by refactoring the delete
method with significant behavioural changes (see comment #2), and has
backported the refactor to Antelope. Also the refactor code applies to
Yoga, it is preferred to implement a simpler fix to address this
specific problem in Yoga. The simpler fix is a retry decorator which
will force the delete method to re-run, picking up the updated
reference of the parent being deleted and therefore succeeding
deleting the childs.
[Test case]
1) Deploy Cinder with Ceph
2) Create a parent volume
openstack volume create --size 1 v1
3) Create the child volumes
for i in {1..9}; do openstack volume create d$i --source v1 --size
1;done
4) Wait for all volumes to be created and available
5) Delete all the volumes
openstack volume delete $(openstack volume list --format value -c ID |
sort | xargs)
6) Check for volumes stuck in error_deleting, if None, repeat steps
2-5
7) Confirm error message rbd.ImageNotFound in the logs
8) Install fixed package
9) Repeat steps 2-5, confirm new error message rbd.ImageNotFound in
the logs but no volumes stuck in error_deleting
[Regression Potential]
For Bobcat and Antelope, there is reasonable regression potential
because of the complexity of refactor [1] (see comment #2), however,
discussions on previous upstream meetings and upstream CI runs of
Caracal, Bobcat and Antelope backports which test the refactor provide
some level of reassurance. For Yoga, we consider no regression
potential with the simpler retry decorator fix.
[Other Info]
None.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2083061/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list