[Bug 2019190] Re: [SRU][RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)
James Page
2019190 at bugs.launchpad.net
Mon Jul 8 13:04:47 UTC 2024
This bug was fixed in the package cinder - 2:20.3.1-0ubuntu1.4~cloud0
---------------
cinder (2:20.3.1-0ubuntu1.4~cloud0) focal; urgency=medium
.
* SECURITY UPDATE for Ubuntu Cloud Archive. backport to focal.
.
cinder (2:20.3.1-0ubuntu1.4) jammy-security; urgency=medium
.
* SECURITY UPDATE: Arbitrary file access via custom QCOW2 external data
(LP: #2059809)
- debian/patches/CVE-2024-32498.patch: check for external qcow2 data
file.
- debian/control: added qemu-utils to Build-Depends so qemu-img is
available for new tests.
- CVE-2024-32498
.
cinder (2:20.3.1-0ubuntu1.2) jammy; urgency=medium
.
[ Jorge Merlino ]
* Increase size of volume image metadata values to 65535 bytes
(LP: #1988942)
.
[ Heather Lemon ]
* Start cinder-volume.service after tgt.service started (LP: #1987663)
- d/cinder-volume.service.conf: drop-in with 'After=' and 'Wants='
('Wants=' is not generated by pkgos-gen-systemd-unit currently).
- d/cinder-volume.install: ship the systemd service drop-in file.
.
[ Seyeong Kim ]
* HPE3PAR: Failing to clone a volume having children (LP: #1994521):
- d/p/0001-HPE-3PAR-Fix-umanaged-volumes-snapshots-missing.patch
- d/p/0002-3PAR-Error-out-if-vol-cannot-be-converted-to-base.patch
- api 4.0.17 is added as it is in the middle of the main patch
(4.0.18)
.
cinder (2:20.3.1-0ubuntu1.1) jammy; urgency=medium
.
* Revert driver assisted volume retype (LP: #2019190):
- d/p/0001-Revert-Driver-assisted-migration-on-retype-when-it-s.patch
.
cinder (2:20.3.1-0ubuntu1) jammy; urgency=medium
.
* New stable point release for OpenStack Yoga (LP: #2037332).
.
cinder (2:20.3.0-0ubuntu1) jammy; urgency=medium
.
* New stable point release for OpenStack Yoga (LP: #2025503).
* d/p/CVE-2023-2088.patch: Dropped. Fixed in point release.
.
cinder (2:20.2.0-0ubuntu1.1) jammy-security; urgency=medium
.
* SECURITY UPDATE: Unauthorized File Access (LP: #2021980)
- debian/patches/CVE-2023-2088.patch: Reject unsafe delete
attachment calls.
- CVE-2023-2088
.
cinder (2:20.2.0-0ubuntu1) jammy; urgency=medium
.
* New stable point release for OpenStack Yoga (LP: #2019759).
* d/p/lp1945500.patch: Dropped. Fixed in stable point release.
.
cinder (2:20.1.0-0ubuntu2.2) jammy-security; urgency=medium
.
* SECURITY REGRESSION: Regressions in other projects (LP: #2020111)
- debian/patches/series: Do not apply CVE-2023-2088.patch until
patches are ready for all upstream OpenStack projects.
- CVE-2023-2088
.
cinder (2:20.1.0-0ubuntu2.1) jammy-security; urgency=medium
.
* SECURITY UPDATE: Unauthorized File Access
- debian/patches/CVE-2023-2088.patch: Reject unsafe delete
attachment calls.
- CVE-2023-2088
.
cinder (2:20.1.0-0ubuntu2) jammy; urgency=medium
.
* d/p/lp1945500.patch: Filter reserved image properties (LP: #1945500).
.
cinder (2:20.1.0-0ubuntu1) jammy; urgency=medium
.
* New stable point release for OpenStack Yoga (LP: #2004030).
.
cinder (2:20.0.1-0ubuntu1) jammy; urgency=medium
.
* d/gbp.conf: Create stable/yoga branch.
* New stable point release for OpenStack Yoga (LP: #1985084).
.
cinder (2:20.0.0-0ubuntu1) jammy; urgency=medium
.
* d/watch: Scope to 20.x.
* New upstream release for OpenStack Yoga.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:19.0.0+git2022030310.b49fb59a6-0ubuntu2) jammy; urgency=medium
.
* d/p/fix-qos-computation.patch: Cherry-pick from upstream review to
fix TypeError exception when generating QOS feature name (LP: #1948507).
.
cinder (2:19.0.0+git2022030310.b49fb59a6-0ubuntu1) jammy; urgency=medium
.
* New upstream snapshot for OpenStack Yoga.
.
cinder (2:19.0.0+git2022011215.23494a6d6-0ubuntu1) jammy; urgency=medium
.
* New upstream snapshot for OpenStack Yoga.
* d/control, d/rules: Bump debhelper compat to 13.
.
cinder (2:19.0.0+git2021120811.e5ef39604-0ubuntu2) jammy; urgency=medium
.
* d/t/control: Add allow-stderr restriction to prevent autopkgtest failure
when SQLAlchemy issues a warning.
.
cinder (2:19.0.0+git2021120811.e5ef39604-0ubuntu1) jammy; urgency=medium
.
* New upstream snapshot for OpenStack Yoga.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:19.0.0-0ubuntu2) impish; urgency=medium
.
* d/py3dist-overrides: Add SQLAlchemy to ensure d/control is not overridden.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:19.0.0-0ubuntu1) impish; urgency=medium
.
* d/watch: Scope to 19.x.
* New upstream release for OpenStack Xena.
.
cinder (2:19.0.0~b1+git2021091409.768b8996b-0ubuntu1) impish; urgency=medium
.
* New upstream snapshot for OpenStack Xena.
.
cinder (2:18.0.0+git2021072116.81f2aaeea-0ubuntu1) impish; urgency=medium
.
* New upstream snapshot for OpenStack Xena.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:18.0.0+git2021061414.d5f0e5187-0ubuntu1) impish; urgency=medium
.
* New upstream snapshot for OpenStack Xena.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:18.0.0-0ubuntu3) hirsute; urgency=medium
.
* d/p/skip-victoria-failures.patch: Restored and rebased. This is still
necessary for Launchpad builds.
.
cinder (2:18.0.0-0ubuntu2) hirsute; urgency=medium
.
* d/p/skip-victoria-failures.patch: Dropped. Fixed upstream.
* d/p/add-mock-psutil-in-quobyte-tests.patch: Dropped. Fixed upstream.
.
cinder (2:18.0.0-0ubuntu1) hirsute; urgency=medium
.
* New upstream release for OpenStack Wallaby.
.
cinder (2:18.0.0~b1-0ubuntu2) hirsute; urgency=medium
.
* d/py3dist-overrides: Add boto3 which is a Suggests.
.
cinder (2:18.0.0~b1-0ubuntu1) hirsute; urgency=medium
.
* d/watch: Track 18.x series.
* New upstream milestone for OpenStack Wallaby.
* d/control: Align (Build-)Depends with upstream.
* d/p/skip-moto-tests.patch: Skip test dependency that is not yet
packaged in Ubuntu and was added late in cycle.
* d/p/patch-botocore-exceptions.patch: Account for changes to botocore
vendored exceptions.
.
cinder (2:17.0.1+git2021012507.d26092348-0ubuntu3) hirsute; urgency=medium
.
* d/*: Remove tgt in favor of targetcli-fb.
.
cinder (2:17.0.1+git2021012507.d26092348-0ubuntu2) hirsute; urgency=medium
.
* d/p/add-mock-psutil-in-quobyte-tests.patch: Add a mock of psutil
disk_partitions to fix failing unit test (LP: #1913607).
.
cinder (2:17.0.1+git2021012507.d26092348-0ubuntu1) hirsute; urgency=medium
.
* New upstream snapshot for OpenStack Wallaby.
.
cinder (2:17.0.1+git2021010614.a9c922ab7-0ubuntu1) hirsute; urgency=medium
.
* New upstream snapshot for OpenStack Wallaby.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:17.0.1+git2020120911.d3ffa90ba-0ubuntu1) hirsute; urgency=medium
.
* New upstream snapshot for OpenStack Wallaby.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:17.0.0-0ubuntu1) groovy; urgency=medium
.
* New upstream release for OpenStack Victoria.
.
cinder (2:17.0.0~rc2-0ubuntu1) groovy; urgency=medium
.
* d/control: Update VCS paths for move to lp:~ubuntu-openstack-dev.
* d/watch: Track 17.x series.
* New upstream release candidate for OpenStack Victoria.
* d/control: Align (Build-)Depends with upstream.
.
cinder (2:17.0.0~b3~git2020091007.afcaf0b9d-0ubuntu3) groovy; urgency=medium
.
* d/py3dist-overrides: Add python3-zstd to py3dist-overrides.
.
cinder (2:17.0.0~b3~git2020091007.afcaf0b9d-0ubuntu2) groovy; urgency=medium
.
* d/p/skip-victoria-failures.patch: Restored to skip failing unit tests.
.
cinder (2:17.0.0~b3~git2020091007.afcaf0b9d-0ubuntu1) groovy; urgency=medium
.
* d/control: Remove Breaks/Replaces that are older than Focal (LP: #1878419).
* New upstream snapshot for OpenStack Victoria.
* d/control: Align (Build-)Depends with upstream.
* d/p/*: Removed. Changes landed upstream and tests fixed.
* d/control: Add new python3-zstd package to depends.
.
cinder (2:17.0.0~b2~git2020073012.2124f39f9-0ubuntu1) groovy; urgency=medium
.
* New upstream snapshot for OpenStack Victoria.
* d/p/*: Refreshed.
.
cinder (2:17.0.0~b1~git2020062409.85fcf1057-0ubuntu1) groovy; urgency=medium
.
* SECURITY UPDATE: Dell EMC ScaleIO/VxFlex OS Backend Credentials Exposure
(LP: #1823200)
- Remove VxFlex OS credentials from connection_properties. Passwords are
now stored in separate file and are retrieved during each attach/detach
operation. Cinder is patched in 16.1.0 stable point release.
- d/control: Align (Build-)Depends with min version of python3-os-brick
required to fix credential exposure.
- CVE-2020-10755
* New upstream snapshot for OpenStack Victoria.
* d/control: Align (Build-)Depends with upstream.
* d/p/py38skip.patch: Dropped. No longer needed.
* d/p/skip-victoria-failures.patch: Rebased and updated with upstream bug.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to cinder in Ubuntu.
https://bugs.launchpad.net/bugs/2019190
Title:
[SRU][RBD] Retyping of in-use boot volumes renders instances unusable
(possible data corruption)
Status in Cinder:
New
Status in Cinder wallaby series:
New
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive antelope series:
Fix Released
Status in Ubuntu Cloud Archive bobcat series:
Fix Released
Status in Ubuntu Cloud Archive caracal series:
Fix Released
Status in Ubuntu Cloud Archive yoga series:
Fix Released
Status in Ubuntu Cloud Archive zed series:
Fix Released
Status in OpenStack Compute (nova):
Invalid
Status in cinder package in Ubuntu:
Fix Released
Status in cinder source package in Jammy:
Fix Released
Status in cinder source package in Lunar:
Won't Fix
Status in cinder source package in Mantic:
Fix Released
Status in cinder source package in Noble:
Fix Released
Bug description:
[Impact]
See bug description for full details but short summary is that a patch
landed in Wallaby release that introduced a regression whereby
retyping an in-use volume leaves the attached volume in an
inconsistent state with potential for data corruption. Result is that
a vm does not receive updated connection_info from Cinder and will
keep pointing to the old volume, even after reboot.
[Test Plan]
* Deploy Openstack with two Cinder RBD storage backends (different pools)
* Create two volume types
* Boot a vm from volume: openstack server create --wait --image jammy --flavor m1.small --key-name testkey --nic net-id=8c74f1ef-9231-46f4-a492-eccdb7943ecd testvm --boot-from-volume 10
* Retype the volume to type B: openstack volume set --type typeB --retype-policy on-demand <volume>
* Go to compute host running vm and check that the vm is now copying data to the new location e.g.
<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='none' discard='unmap'/>
<auth username='cinder-ceph'>
<secret type='ceph' uuid='01b65a79-22a3-4672-80e7-5a47b0e5581a'/>
</auth>
<source protocol='rbd' name='cinder-ceph/volume-b68be47d-f526-4f98-a77b-a903bf8b6c65' index='1'>
<host name='10.5.2.236' port='6789'/>
</source>
<mirror type='network' job='copy'>
<format type='raw'/>
<source protocol='rbd' name='cinder-ceph-alt/volume-c6b55b4c-a540-4c39-ad1f-626c964ae3e1' index='2'>
<host name='10.5.2.236' port='6789'/>
<auth username='cinder-ceph-alt'>
<secret type='ceph' uuid='e089e27e-3a2f-49d6-b6d9-770f52177eb1'/>
</auth>
</source>
<backingStore/>
</mirror>
<target dev='vda' bus='virtio'/>
<serial>b68be47d-f526-4f98-a77b-a903bf8b6c65</serial>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</disk>
which will eventually settle and change to:
<disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='none' discard='unmap'/>
<auth username='cinder-ceph-alt'>
<secret type='ceph' uuid='e089e27e-3a2f-49d6-b6d9-770f52177eb1'/>
</auth>
<source protocol='rbd' name='cinder-ceph-alt/volume-c6b55b4c-a540-4c39-ad1f-626c964ae3e1' index='2'>
<host name='10.5.2.236' port='6789'/>
</source>
<backingStore/>
<target dev='vda' bus='virtio'/>
<serial>b68be47d-f526-4f98-a77b-a903bf8b6c65</serial>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</disk>
* And lastly a reboot of the vm should be successfull.
[Regression Potential]
Given that the current state is potential data corruption and the patch will fix this by successfully refreshing connection info I do not see a regression potential. It is in fact fixing a regression.
-------------------------------------------------------------------------
While trying out the volume retype feature in cinder, we noticed that after an instance is
rebooted it will not come back online and be stuck in an error state or if it comes back
online, its filesystem is corrupted.
## Observations
Say there are the two volume types `fast` (stored in ceph pool `volumes`) and `slow`
(stored in ceph pool `volumes.hdd`). Before the retyping we can see that the volume
for example is present in the `volumes.hdd` pool and has a watcher accessing the
volume.
```sh
[ceph: root at mon0 /]# rbd ls volumes.hdd
volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
[ceph: root at mon0 /]# rbd status volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
Watchers:
watcher=[2001:XX:XX:XX::10ad]:0/3914407456 client.365192 cookie=140370268803456
```
Starting the retyping process using the migration policy `on-demand` for that volume either
via the horizon dashboard or the CLI causes the volume to be correctly transferred to the
`volumes` pool within the ceph cluster. However, the watcher does not get transferred, so
nobody is accessing the volume after it has been transferred.
```sh
[ceph: root at mon0 /]# rbd ls volumes
volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
[ceph: root at mon0 /]# rbd status volumes/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
Watchers: none
```
Taking a look at the libvirt XML of the instance in question, one can see that the `rbd`
volume path does not change after the retyping is completed. Therefore, if the instance is
restarted nova will not be able to find its volume preventing an instance start.
#### Pre retype
```xml
[...]
<source protocol='rbd' name='volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9' index='1'>
<host name='2001:XX:XX:XXX::a088' port='6789'/>
<host name='2001:XX:XX:XXX::3af1' port='6789'/>
<host name='2001:XX:XX:XXX::ce6f' port='6789'/>
</source>
[...]
```
#### Post retype (no change)
```xml
[...]
<source protocol='rbd' name='volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9' index='1'>
<host name='2001:XX:XX:XXX::a088' port='6789'/>
<host name='2001:XX:XX:XXX::3af1' port='6789'/>
<host name='2001:XX:XX:XXX::ce6f' port='6789'/>
</source>
[...]
```
### Possible cause
While looking through the code that is responsible for the volume retype we found a function
`swap_volume` volume which by our understanding should be responsible for fixing the association
above. As we understand cinder should use an internal API path to let nova perform this action.
This doesn't seem to happen.
(`_swap_volume`:
https://github.com/openstack/nova/blob/stable/wallaby/nova/compute/manager.py#L7218)
## Further observations
If one tries to regenerate the libvirt XML by e.g. live migrating the instance and rebooting the
instance after, the filesystem gets corrupted.
## Environmental Information and possibly related reports
We are running the latest version of TripleO Wallaby using the hardened (whole disk)
overcloud image for the nodes.
Cinder Volume Version: `openstack-
cinder-18.2.2-0.20230219112414.f9941d2.el8.noarch`
### Possibly related
- https://bugzilla.redhat.com/show_bug.cgi?id=1293440
(might want to paste the above to a markdown file for better
readability)
To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2019190/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list