[Bug 1944759] Re: [SRU] confirm resize fails with CPUUnpinningInvalid
Andreas Hasenack
1944759 at bugs.launchpad.net
Mon Nov 11 19:18:44 UTC 2024
Hello Balazs, or anyone else affected,
Accepted nova into focal-proposed. The package will build now and be
available at
https://launchpad.net/ubuntu/+source/nova/2:21.2.4-0ubuntu2.14 in a few
hours, and then in the -proposed repository.
Please help us by testing this new package. See
https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how
to enable and use -proposed. Your feedback will aid us getting this
update out to other Ubuntu users.
If this package fixes the bug for you, please add a comment to this bug,
mentioning the version of the package you tested, what testing has been
performed on the package and change the tag from verification-needed-
focal to verification-done-focal. If it does not fix the bug for you,
please add a comment stating that, and change the tag to verification-
failed-focal. In either case, without details of your testing we will
not be able to proceed.
Further information regarding the verification process can be found at
https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in
advance for helping!
N.B. The updated package will be released to -updates after the bug(s)
fixed by this package have been verified and the package has been in
-proposed for a minimum of 7 days.
** Changed in: nova (Ubuntu Focal)
Status: Triaged => Fix Committed
** Tags added: verification-needed verification-needed-focal
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1944759
Title:
[SRU] confirm resize fails with CPUUnpinningInvalid
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive ussuri series:
Triaged
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) ussuri series:
New
Status in nova package in Ubuntu:
Invalid
Status in nova source package in Focal:
Fix Committed
Bug description:
* SRU DESCRIPTION BELOW *
Nova has a race condition between resize_instance() compute manager
call and the update_available_resources periodic job. If they overlap
at the right place, when resize_instance calls finish_resize, then
periodic job will not track the migration nor the instance on the
source host. It causes that the PCPU allocation on the source host is
dropped in the resource tracker (not in placement). Then when the
resize is confirmed nova tries to free the pinned cpus again on the
source host and fails with CPUUnpinningInvalid as they are already
freed.
I've pushed a reproduction test:
https://review.opendev.org/c/openstack/nova/+/810763
It is reproducible at least on master, xena, wallaby, and victoria
===============
SRU DESCRIPTION
===============
[Impact]
Due to a race condition the tracking of pinned CPU resources can go
off-sync causing "No valid host" errors while being unable to create
new instances with CPU pinning, as the previous pinned CPUs were not
marked as freed.
Part of the reason is addressed in the fix for LP#1953359 where a
migration context is not pointing to the proper node during the race
condition window, resulting in a CPUPinningInvalid error. This fix
complements LP#1953359 by addressing the improper tracking of
resources that happens only when the resource tracker periodic job
runs in the source node while the flavor registered corresponds to the
one of the destination. That is solved by setting the
instance.old_flavor so the CPU pinning resources are tracked properly.
[Test case]
The test case for this was already implemented on non-live functional
tests upstream:
in nova/tests/functional/libvirt/test_numa_servers.py:
- test_resize_dedicated_policy_race_on_dest_bug_1953359
- test_resize_confirm_bug_1944759
- test_resize_revert_bug_1944759
As this is a race condition it is very difficult to validate, even
upstream, so the functional tests mock certain parts of the code to
simulate the entire scope of the workflow. It is a non-live functional
test, so it is more akin to a broader unit test.
The test case that will be run for this SRU is running the charmed-
openstack-tester [1] against the environment containing the upgraded
package (essentially as it would be in a point release SRU) and expect
the test to pass. Test run evidence will be attached to LP.
[Regression Potential]
The code is considered stable today in newer releases and the scope of
the code affected is fairly limited. Given that it is a race condition
that it is difficult to validate, despite the non-live functional
tests, the regression potential is moderate.
[Other Info]
None.
[1] https://github.com/openstack-charmers/charmed-openstack-tester
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1944759/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list