[Bug 2091033] Re: Un-proxied libvirt calls list(All)Devices() can cause nova-compute to freeze for hours
Heitor Alves de Siqueira
2091033 at bugs.launchpad.net
Mon Jun 23 14:08:09 UTC 2025
Hi hopem,
thanks for the debdiffs! The proposed patches seem to be missing
mandatory DEP-3 headers (notably, Origin: and Bug-Ubuntu:), could you
please add those?
Also, could you please give a more detailed account of potential
regressions in the SRU template "[Regression Potential]"? It's usually
expected that SRUs will not introduce regressions, but this section can
help us identify relevant code paths or test scenarios that could show
different behavior pre/post patch. For example, [0] seems to change some
internal APIs for device listing which seems risky in case we overlook
any consumers of this function.
This fix also seems to be related to a couple of other Launchpad bugs
(only the description already lists bug 1840912 and bug 2098892, both of
which are still open). It'd be great if we could clean this up, either
by closing other bugs, marking them as duplicates or detailing the
tracking of these fixes.
[0] libvirt: Wrap un-proxied listDevices() and listAllDevices() :
https://review.opendev.org/q/I60d6f04d374e9ede5895a43b7a75e955b0fea3c5
** Changed in: nova (Ubuntu Jammy)
Status: In Progress => Incomplete
** Changed in: nova (Ubuntu Noble)
Status: In Progress => Incomplete
** Changed in: nova (Ubuntu Oracular)
Status: In Progress => Incomplete
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/2091033
Title:
Un-proxied libvirt calls list(All)Devices() can cause nova-compute to
freeze for hours
Status in Ubuntu Cloud Archive:
New
Status in Ubuntu Cloud Archive antelope series:
Won't Fix
Status in Ubuntu Cloud Archive bobcat series:
Won't Fix
Status in Ubuntu Cloud Archive caracal series:
New
Status in Ubuntu Cloud Archive dalmatian series:
New
Status in Ubuntu Cloud Archive yoga series:
New
Status in Ubuntu Cloud Archive zed series:
Won't Fix
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) 2024.1 series:
Fix Committed
Status in OpenStack Compute (nova) 2024.2 series:
Fix Committed
Status in OpenStack Compute (nova) antelope series:
Fix Committed
Status in OpenStack Compute (nova) bobcat series:
Fix Released
Status in nova package in Ubuntu:
Fix Released
Status in nova source package in Jammy:
Incomplete
Status in nova source package in Noble:
Incomplete
Status in nova source package in Oracular:
Incomplete
Bug description:
[Impact]
Nova uses evently.tpool.Proxy to defer actions/commands that would
otherwise risk starving eventlets. This patch fixes the issue where
virNodeDevice returned from libvirt were not wrapped by the proxy and
therefore executed outside the thread which leads to starvation. There
are two patches required to fix this issue, the first is the one in
this bug and the second is to fix a regression subsequently identified
by the first patch (bug 2098892).
[Test Plan]
* Deploy Openstack Yoga with SRIOV enabled. Create and delete lots of vms over a period of several hours if not days
* ensure that the amount of time nova.compute.resource_tracker takes to run does not continuously increase (can use https://github.com/dosaboy/openstack-analysis to determine this)
[Regression Potential]
* no regression potential is expected as a result of this set of
patches.
--------------------------------------------------------------------------
tl;dr This bug has the same root cause as
https://bugs.launchpad.net/nova/+bug/1840912 where items in lists
returned from libvirt are not automatically wrapped in a tpool.Proxy.
Discovered during investigation of a downstream bug [1] where a live
migration was dirtying memory faster than the transfer and nova-
compute became frozen unable to perform any other operations, not even
logging, for hours.
The freezing was tracked down to un-proxied libvirt call
listAllDevices() which could block all other greenthreads. The
listAllDevices() call occurs during the update_available_resource()
periodic task in the libvirt driver in _get_pci_passthrough_devices().
In a GMR collected during a repro of the issue, a traceback showing
this was present in the report [2]:
tderr F /usr/lib/python3.6/site-packages/oslo_service/periodic_task.py:222 in run_periodic_tasks
stderr F `task(self, context)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9142 in update_available_resource
stderr F `startup=startup)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/compute/manager.py:9056 in _update_available_resource_for_node
stderr F `startup=startup)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/compute/resource_tracker.py:911 in update_available_resource
stderr F `resources = self.driver.get_available_resource(nodename)`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8369 in get_available_resource
stderr F `data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in _get_pci_passthrough_devices
stderr F `in devices.items() if "pci" in dev.listCaps()]`
stderr F
stderr F /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:7080 in <listcomp>
stderr F `in devices.items() if "pci" in dev.listCaps()]`
stderr F
stderr F /usr/lib64/python3.6/site-packages/libvirt.py:6313 in listCaps
stderr F `ret = libvirtmod.virNodeDeviceListCaps(self._o)`
The listAllDevices() function returned a list of unwrapped
virNodeDevice objects and so calling listCaps() on such an unwrapped
device could cause a freeze.
Based on the above, the bug reporter was able to test a patch [3] to
wrap listAllDevices() list items in tpool.Proxy and the result showed
nova-compute no longer freezing [4] in the aforementioned scenario.
During investigation it was also noticed that the listDevices() call
list items were not tpool.Proxy wrapped, so this is fixed as well in
the patch.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2312196
[2] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c13
[3] https://review.opendev.org/c/openstack/nova/+/932669
[4] https://bugzilla.redhat.com/show_bug.cgi?id=2312196#c21
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2091033/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list