[Bug 2088620] Re: [SRU] Deprecated usage of cpu_util

Wed Sep 24 14:23:06 UTC 2025

Verification
============

Jammy / Yoga
------------

Since I am using the workload_balance strategy which makes use of the
instance_cpu_usage to suggest live migrations that would better
distribute the workload, I used an environment with two compute
hypervisors. I will force instances to bind to one host so that Watcher
will see an over utilized and an under utilized host

Relevant portion of juju status:
Model       Controller  Cloud/Region          Version  SLA          Timestamp
sf00402123  ps-ps6      prodstack/prodstack6  3.6.9    unsupported  13:27:11Z

App                                 Version          Status  Scale  Charm                   Channel        Rev  Exposed  Message
...
nova-compute                        25.2.1           active      2  nova-compute            yoga/stable    783  no       Unit is ready
...
Unit                                     Workload  Agent  Machine  Public address  Ports           Message
nova-compute/0*                          active    idle   16       10.149.90.142                   Unit is ready
  ceilometer-agent/0*                    active    idle            10.149.90.142                   Unit is ready
  ovn-chassis/0*                         active    idle            10.149.90.142                   Unit is ready
nova-compute/1                           active    idle   17       10.149.90.180                   Unit is ready
  ceilometer-agent/1                     active    idle            10.149.90.180                   Unit is ready
  ovn-chassis/1                          active    idle            10.149.90.180                   Unit is ready
...

Observe two hypervisors
openstack hypervisor list
+--------------------------------------+---------------------------+-----------------+---------------+-------+
| ID                                   | Hypervisor Hostname       | Hypervisor Type | Host IP       | State |
+--------------------------------------+---------------------------+-----------------+---------------+-------+
| 47b28369-052e-4fd8-b2f4-6db3df5cd624 | juju-38abf1-sf00402123-17 | QEMU            | 10.149.90.180 | up    |
| 96112f46-b0f9-41e5-a429-3128e31b73e5 | juju-38abf1-sf00402123-16 | QEMU            | 10.149.90.142 | up    |
+--------------------------------------+---------------------------+-----------------+---------------+-------+

Create two servers on the same hypervisor (eg. nova-compute/1):
openstack server create --image jammy --key-name testkey --flavor m1.small --network private --availability-zone nova:juju-38abf1-sf00402123-17 server1
openstack server create --image cirros-0.4.0 --key-name testkey --flavor m1.tiny --network private --availability-zone nova:juju-38abf1-sf00402123-17 server2

Not necessary, but we can also create a VM on the other hypervisor just to simulate workload being on both machines
openstack server create --image cirros-0.4.0 --key-name testkey --flavor m1.tiny --network private --availability-zone nova:juju-38abf1-sf00402123-16 server3

openstack server list
+--------------------------------------+---------+--------+------------------------+--------------+----------+
| ID                                   | Name    | Status | Networks               | Image        | Flavor   |
+--------------------------------------+---------+--------+------------------------+--------------+----------+
| 1f59b676-14ac-4a6b-9678-7c8b810c9e4e | server3 | ACTIVE | private=192.168.21.207 | cirros-0.4.0 | m1.tiny  |
| 1fea5424-2368-4f71-916c-ff5b76107834 | server2 | ACTIVE | private=192.168.21.123 | cirros-0.4.0 | m1.tiny  |
| 957ef38a-0a0b-4dce-aaf7-fd88951429bb | server1 | ACTIVE | private=192.168.21.248 | jammy        | m1.small |
+--------------------------------------+---------+--------+------------------------+--------------+----------+

Determine the metric uuids just so we know what's being tracked by
Watcher

openstack metric list  # There are many metrics reported, but I only show the ones that are relevant to this issue:
+--------------------------------------+---------------------+---------------------------------+---------+--------------------------------------+
| id                                   | archive_policy/name | name                            | unit    | resource_id                          |
+--------------------------------------+---------------------+---------------------------------+---------+--------------------------------------+
| 4efc5b1f-a612-4212-a48f-83d551957e59 | ceilometer-low-rate | cpu                             | ns      | 957ef38a-0a0b-4dce-aaf7-fd88951429bb |
| 5269e2a0-4fc7-4ce2-95f5-14de16fae450 | ceilometer-low-rate | cpu                             | ns      | 1fea5424-2368-4f71-916c-ff5b76107834 |
| 9f7f9c8c-1667-46d7-838b-1492fc1efc8a | ceilometer-low-rate | cpu                             | ns      | 1f59b676-14ac-4a6b-9678-7c8b810c9e4e |
| 3c3aaa31-9553-482d-8ce2-b269df87092b | ceilometer-low      | compute.node.cpu.percent        | percent | ae7602e9-96ac-5414-9c11-f140339326f5 |
| 63e8aa43-b3d1-44ee-a473-8f508464b562 | ceilometer-low      | compute.node.cpu.percent        | percent | c328d411-2902-56fe-9642-694b1f8a97e8 |
...
+--------------------------------------+---------------------+---------------------------------+---------+--------------------------------------+

So 4efc5b1f-a612-4212-a48f-83d551957e59 is the uuid to track the cpu usage of server1 in ns, 5269e2a0-4fc7-4ce2-95f5-14de16fae450 tracks server2's cpu usage, 9f7f9c8c-1667-46d7-838b-1492fc1efc8a server3's, and the remaining two are the hypervisors
Wait for a few minutes for ceilometer to populate gnocchi with metrics so we can see what an idle load looks like. After ~10 minutes ssh to the VMs that are on the same host (server1 and server2) and create some busy waiting load (eg. stress --cpu 2, yes > /dev/null)

juju ssh nova-compute/1
sudo ip netns exec ovnmeta-85650bd9-d438-42c2-8630-f4bc5545a97f ssh cirros at 192.168.21.123
yes > /dev/null

juju ssh nova-compute/1
sudo ip netns exec ovnmeta-85650bd9-d438-42c2-8630-f4bc5545a97f ssh -i ~/testkey.priv ubuntu at 192.168.21.248
yes > /dev/null

Observe the cpu time increase after making the VMs spin

openstack metric measures aggregation --metric 4efc5b1f-a612-4212-a48f-83d551957e59 --aggregation "rate:mean"
# server1, OS: jammy, on nova-compute/1
+---------------------------+-------------+----------------+
| timestamp                 | granularity |          value |
+---------------------------+-------------+----------------+
| 2025-09-24T01:20:00+00:00 |       300.0 |   1670000000.0 |
| 2025-09-24T01:25:00+00:00 |       300.0 |   1150000000.0 |
| 2025-09-24T01:30:00+00:00 |       300.0 |   1430000000.0 |
| 2025-09-24T01:35:00+00:00 |       300.0 |  68770000000.0 |  # At some point during this interval corresponds to when I executed yes > /dev/null
| 2025-09-24T01:40:00+00:00 |       300.0 | 285100000000.0 |
| 2025-09-24T01:45:00+00:00 |       300.0 | 285060000000.0 |
| 2025-09-24T01:50:00+00:00 |       300.0 | 285280000000.0 |
...

openstack metric measures aggregation --metric 5269e2a0-4fc7-4ce2-95f5-14de16fae450 --aggregation "rate:mean"
# server2, OS: cirros, so the idle cpu usage is lower than the jammy machine above, on nova-compute/1
+---------------------------+-------------+----------------+
| timestamp                 | granularity |          value |
+---------------------------+-------------+----------------+
| 2025-09-24T01:20:00+00:00 |       300.0 |    240000000.0 |
| 2025-09-24T01:25:00+00:00 |       300.0 |    230000000.0 |
| 2025-09-24T01:30:00+00:00 |       300.0 |    260000000.0 |
| 2025-09-24T01:35:00+00:00 |       300.0 | 177160000000.0 |  # At some point during this interval corresponds to when I executed yes > /dev/null
| 2025-09-24T01:40:00+00:00 |       300.0 | 285380000000.0 |
| 2025-09-24T01:45:00+00:00 |       300.0 | 286300000000.0 |
...

openstack metric measures aggregation --metric 9f7f9c8c-1667-46d7-838b-1492fc1efc8a --aggregation "rate:mean"
# server3, OS: cirros, completely idle (no busy-waiting load), on nova-compute/0
+---------------------------+-------------+-------------+
| timestamp                 | granularity |       value |
+---------------------------+-------------+-------------+
| 2025-09-24T01:20:00+00:00 |       300.0 | 200000000.0 |
| 2025-09-24T01:25:00+00:00 |       300.0 | 210000000.0 |
| 2025-09-24T01:30:00+00:00 |       300.0 | 210000000.0 |
| 2025-09-24T01:40:00+00:00 |       300.0 | 210000000.0 |
...

So at this point we have two machines on nova-compute/1 that are
spinning at ~100% usage, and a completely idle vm on nova-compute/0.
Watcher should suggest to migrate one of the spinning VMs from one
hypervisor to the other to better distribute the workload

Create an audit
openstack optimize audit create   -t ONESHOT   -g workload_balancing   -s workload_balance --parameter threshold=50

openstack optimize audit show 9e34bf3d-a083-4fb2-9825-f7b31fc7f48e
+---------------+-----------------------------------------------------------------------------------------+
| Field         | Value                                                                                   |
+---------------+-----------------------------------------------------------------------------------------+
| UUID          | 9e34bf3d-a083-4fb2-9825-f7b31fc7f48e                                                    |
| Name          | workload_balance-2025-09-24T02:50:10.842906                                             |
| Created At    | 2025-09-24T02:50:11+00:00                                                               |
| Updated At    | 2025-09-24T02:50:11+00:00                                                               |
| Deleted At    | None                                                                                    |
| State         | SUCCEEDED                                                                               |
| Audit Type    | ONESHOT                                                                                 |
| Parameters    | {'metrics': 'instance_cpu_usage', 'threshold': 50, 'period': 300, 'granularity': 300}   |
| Interval      | None                                                                                    |
| Goal          | workload_balancing                                                                      |
| Strategy      | workload_balance                                                                        |
| Audit Scope   | []                                                                                      |
| Auto Trigger  | False                                                                                   |
| Next Run Time | None                                                                                    |
| Hostname      | juju-38abf1-sf00402123-24                                                               |
| Start Time    | None                                                                                    |
| End Time      | None                                                                                    |
| Force         | False                                                                                   |
+---------------+-----------------------------------------------------------------------------------------+

Check the generated actionplan and see that Watcher has no
recommendations

openstack optimize actionplan list
+--------------------------------------+--------------------------------------+-------------+---------------------------+--------------------------------+
| UUID                                 | Audit                                | State       | Updated At                | Global efficacy                |
+--------------------------------------+--------------------------------------+-------------+---------------------------+--------------------------------+
| e0907a11-aa7b-4f9a-935d-29073bb4672a | 9e34bf3d-a083-4fb2-9825-f7b31fc7f48e | SUCCEEDED   | 2025-09-24T02:50:11+00:00 | Live_migrations_count: 0.00 %  |
+--------------------------------------+--------------------------------------+-------------+---------------------------+--------------------------------+

Upgrade to -proposed but keep the workload the same
===================================================

Create another audit with the same details
openstack optimize audit create   -t ONESHOT   -g workload_balancing   -s workload_balance --parameter threshold=50

openstack optimize audit show da4e3c86-1978-477a-8a41-2bd650f0e59b
+---------------+---------------------------------------------------------------------------------------+
| Field         | Value                                                                                 |
+---------------+---------------------------------------------------------------------------------------+
| UUID          | da4e3c86-1978-477a-8a41-2bd650f0e59b                                                  |
| Name          | workload_balance-2025-09-24T04:05:42.948316                                           |
| Created At    | 2025-09-24T04:05:43+00:00                                                             |
| Updated At    | 2025-09-24T04:05:46+00:00                                                             |
| Deleted At    | None                                                                                  |
| State         | SUCCEEDED                                                                             |
| Audit Type    | ONESHOT                                                                               |
| Parameters    | {'threshold': 50, 'metrics': 'instance_cpu_usage', 'period': 300, 'granularity': 300} |
| Interval      | None                                                                                  |
| Goal          | workload_balancing                                                                    |
| Strategy      | workload_balance                                                                      |
| Audit Scope   | []                                                                                    |
| Auto Trigger  | False                                                                                 |
| Next Run Time | None                                                                                  |
| Hostname      | juju-38abf1-sf00402123-24                                                             |
| Start Time    | None                                                                                  |
| End Time      | None                                                                                  |
| Force         | False                                                                                 |
+---------------+---------------------------------------------------------------------------------------+

Observe that now Watcher suggests a VM migration

openstack optimize actionplan list
+--------------------------------------+--------------------------------------+-------------+---------------------------+--------------------------------+
| UUID                                 | Audit                                | State       | Updated At                | Global efficacy                |
+--------------------------------------+--------------------------------------+-------------+---------------------------+--------------------------------+
| e0907a11-aa7b-4f9a-935d-29073bb4672a | 9e34bf3d-a083-4fb2-9825-f7b31fc7f48e | SUCCEEDED   | 2025-09-24T02:50:11+00:00 | Live_migrations_count: 0.00 %  |
|                                      |                                      |             |                           |                                |
| cb28a2ae-d8ab-4eb2-a0fc-c231587495c2 | da4e3c86-1978-477a-8a41-2bd650f0e59b | RECOMMENDED | None                      | Live_migrations_count: 33.33 % |
|                                      |                                      |             |                           |                                |
+--------------------------------------+--------------------------------------+-------------+---------------------------+--------------------------------+

openstack optimize actionplan show cb28a2ae-d8ab-4eb2-a0fc-c231587495c2
+---------------------+-------------------------------------------------------------------+
| Field               | Value                                                             |
+---------------------+-------------------------------------------------------------------+
| UUID                | cb28a2ae-d8ab-4eb2-a0fc-c231587495c2                              |
| Created At          | 2025-09-24T04:05:46+00:00                                         |
| Updated At          | None                                                              |
| Deleted At          | None                                                              |
| Audit               | da4e3c86-1978-477a-8a41-2bd650f0e59b                              |
| Strategy            | workload_balance                                                  |
| State               | RECOMMENDED                                                       |
| Efficacy indicators | - Description: The number of VM migrations to be performed.       |
|                     |   Name: instance_migrations_count                                 |
|                     |   Unit: null                                                      |
|                     |   Value: 1.0                                                      |
|                     | - Description: The total number of audited instances in strategy. |
|                     |   Name: instances_count                                           |
|                     |   Unit: null                                                      |
|                     |   Value: 3.0                                                      |
|                     |                                                                   |
| Global efficacy     | Live_migrations_count: 33.33 %                                    |
|                     |                                                                   |
| Hostname            | None                                                              |
+---------------------+-------------------------------------------------------------------+

** Tags removed: verification-needed verification-needed-jammy
** Tags added: verification-done verification-done-jammy

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/2088620

Title:
  [SRU] Deprecated usage of cpu_util

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive antelope series:
  Fix Released
Status in Ubuntu Cloud Archive bobcat series:
  Fix Released
Status in Ubuntu Cloud Archive caracal series:
  Fix Released
Status in Ubuntu Cloud Archive dalmatian series:
  Fix Released
Status in Ubuntu Cloud Archive epoxy series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  Won't Fix
Status in watcher package in Ubuntu:
  Fix Released
Status in watcher source package in Focal:
  Confirmed
Status in watcher source package in Jammy:
  Fix Committed
Status in watcher source package in Noble:
  Fix Released
Status in watcher source package in Oracular:
  Fix Released

Bug description:
  [ Impact ]

    * The watcher releases targeted by this SRU are using a deprecated
  ceilometer metric, cpu_util, which previously reported cpu utilization
  as a percentage. This metric was deprecated in Openstack Rocky in
  favor of "the gnocchi rate calculation equivalent" [1] - essentially
  meaning that the cpu utilization value should be obtained by
  performing a calculation with gnocchi's rates. The ceilometer metric
  cpu_util was then fully removed in Stein.

    * Upstream Watcher continued to use cpu_util until the commit at [2]
  landed on master for 2024.1. Since the ceilometer no longer has a
  cpu_util metric, polling this metric returns "None". What this means
  is that all Watcher strategies, particularly those relating to
  workload balancing and migration of VMs to under-utilized hosts, which
  rely on cpu_util are non-functional from Stein, when the metric was
  removed, until Caracal.

    * This commit consumes the correct metric (cpu) and performs the
  utilization calculation as intended using gnocchi's rates. The
  calculation is summarized in the next bullet point and there is an
  example calculation in the original commit

    * Gnocchi uses the cumulative cpu time in ns (reported by the
  ceilometer metric, "cpu") and consumes it as a rate (essentially it
  computes the difference in cumulative cpu time over the last two
  sampling intervals) to find the total cpu time during the previous
  sampling period. Dividing the cpu time in one interval by the duration
  of the interval multiplied by the number of vcpus provides the cpu
  utilization as a percentage: cpu_usage = [cpu_time / (period * 10^9 *
  nvcpus)] * 100%. A sample calculation is provided in the original
  commit message.

    * I cherry-picked to stable/2023.2 [3], but the other branches have
  gone unmaintained

  [ Test Plan ]

    * Deploy openstack yoga on jammy with watcher and gnocchi services

    * Launch a server and take note of it's resource id. Then find the
  gnocchi cpu metric associated with the instance via openstack metric
  resource list and openstack metric list. Note that there is no
  "cpu_util" metric

    * Create a watcher audit based on a goal that previously depended on instance cpu utilization (from Watcher's perspective this is called instance_cpu_usage). For example the workload_balance goal [4] depends on instance_cpu_usage
      Ex. openstack optimize audit create -t CONTINUOUS -i 60 -g workload_balancing -s workload_balance --auto-trigger

    * Without the patch, the workload_balance strategy does not work.
  The audit will be created, but it cannot provide any meaningful action
  plan since instance_cpu_usage is None in the audits. With the patch
  Watcher obtains the correct cpu utilization percentage and the
  strategies work as expected and suggest actions/actionplans.

  [ What can go wrong ]

    * While the patch restores functionality by calculating cpu
  utilization using gnocchi's rate metric, if gnocchi is misconfigured
  or the relevant "cpu" metric is missing, the new calculation may not
  work as anticipated

  [1] https://docs.openstack.org/releasenotes/ceilometer/rocky.html
  [2] https://review.opendev.org/c/openstack/watcher/+/898791
  [3] https://review.opendev.org/c/openstack/watcher/+/934181
  [4] https://docs.openstack.org/watcher/2024.1/strategies/workload_balance.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2088620/+subscriptions