[Bug 1978489] Re: libvirt / cgroups v2: cannot boot instance with more than 16 CPUs
Edward Hope-Morley
1978489 at bugs.launchpad.net
Wed Mar 6 11:29:05 UTC 2024
As a recap, this patch addresses the problem of moving vms between hosts
running cgroups v1 (e.g. Ubuntu Focal) and v2 (Ubuntu Jammy) which now
has a cap of 10K [1] for cpu.weight, resulting in vms with > 9 vcpus not
being able to boot if they use the default Nova 1024 * guest.vcpus. The
patch addresses the problem by no longer applying a default weight to
instances while keeping the option to apply quota:cpu_shares from a
flavor extra-specs.
The consequence of this is:
Vms booted without quota:cpu_shares extra-specs after upgrading to this patch will have the default cgroups v2 weight of 100.
New Vms can get a higher weight if they use a flavor with extra-specs quota:cpu_shares BUT this will only apply to existing vms if they are resized so as to switch to using the new/modified flavor which will require workload downtime - a vm reboot will not consume the new value.
Vms created from a flavor with extra-specs quota:cpu_shares set to a value > 10K will fail to boot and to fix this will require a new/modified flavor with adjusted value then vm resize to consume therefore workload downtime.
It is important to note that point 3 is not a consequence of this patch
and is therefore neither introduced nor resolved by it and will require
a separate patch solution. One way to resolve this could be to have Nova
cap quota:cpu_shares at cgroup cpu.weight max value and log a warning to
say that was done, that way instances will at least boot and have a max
weight. Therefore I am in favour of proceeding with this SRU to provide
users a way to migrate from v1 to v2 and suggest we propose a new patch
to address the flavor extra-specs issue. As @jamespage has pointed out
there are some interim manual solutions that can be used as a stop-gap
until this is fully resolved in Nova.
[1] https://www.kernel.org/doc/Documentation/cgroup-v2.txt
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1978489
Title:
libvirt / cgroups v2: cannot boot instance with more than 16 CPUs
Status in Ubuntu Cloud Archive:
Invalid
Status in Ubuntu Cloud Archive yoga series:
Fix Committed
Status in OpenStack Compute (nova):
In Progress
Status in nova package in Ubuntu:
Confirmed
Status in nova source package in Jammy:
Fix Committed
Bug description:
Description
===========
Using the libvirt driver and a host OS that uses cgroups v2 (RHEL 9,
Ubuntu Jammy), an instance with more than 16 CPUs cannot be booted.
Steps to reproduce
==================
1. Boot an instance with 10 (or more) CPUs on RHEL 9 or Ubuntu Jammy
using Nova with the libvirt driver.
Expected result
===============
Instance boots.
Actual result
=============
Instance fails to boot with a 'Value specified in CPUWeight is out of
range' error.
Environment
===========
Originially report as a libvirt but in RHEL 9 [1]
Additional information
======================
This is happening because Nova defaults to 1024 * (# of CPUs) for the
value of domain/cputune/shares in the libvirt XML. This is then passed
directly by libvirt to the cgroups API, but cgroups v2 has a maximum
value of 10000. 10000 / 1024 ~= 9.76
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2035518
====================================
Ubuntu SRU Details:
[Impact]
See above.
[Test Case]
See above.
[Regression Potential]
We've had this change in other jammy-based versions of the nova package for a while now, including zed, antelope, bobcat.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1978489/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list