[Bug 2131043] Re: watcher decision engine fails when evaluating previously migrated instance

Thu Dec 4 12:02:28 UTC 2025

Watcher contains an internal model of your cluster topology, including
instances and compute nodes.

When no notifications are enabled, that model is updated periodically
based on your configuration:

[watcher_cluster_data_model_collectors.baremetal]
period = 3600

[watcher_cluster_data_model_collectors.compute]
period = 3600

[watcher_cluster_data_model_collectors.storage]
period = 3600

In your case it's being updated every hour, while you are running your
audit every minute so it will very easily hit issues because the model
is out of date. Even if not getting that error, you may get unexpected
results.

On improvement you can do is to decrease the period to something close
to your interval. You could i.e. use

--interval 120 and set:

[watcher_cluster_data_model_collectors.baremetal]
period = 120

[watcher_cluster_data_model_collectors.compute]
period = 120

[watcher_cluster_data_model_collectors.storage]
period = 120

Alternatively to that schedule-based clustermodel sync, watcher has the ability to synchronize the model based on nova notification events. i.e. when a new vm is created, deleted, etc... nova can send a notification message into a queue and watcher consumes it and update the model. For that you need to enable notifications both in nova and watcher side. In watcher you need to set:

[oslo_messaging_notifications]
driver = messagingv2
transport_url = <url for notifications messaging broker>

I'd also recomend to set in [DEFAULT] section:

notification_level =

Note you also need to configure oslo_messaging_notifications in nova
service.

Usually, a recommendation is to use separated messaging broker instance
different to the one for RPCs between services.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to watcher in Ubuntu.
https://bugs.launchpad.net/bugs/2131043

Title:
  watcher decision engine fails when evaluating previously migrated
  instance

Status in watcher:
  New
Status in watcher package in Ubuntu:
  New

Bug description:
  jammy + caracal
  openstack 2024.1 charms

  using the workload_balance strategy, an audit will fail when a machine
  previously evaluated has been migrated due to an action plan, but it
  goes back to evaluate that machine.  it no longer has metrics for that
  machine and it causes the decision engine to crash.

  More specifically, the machine is no longer in the workload_cache and
  so it causes a key error.  Logs when this issue happens are:

  2025-11-06 19:30:47.191 3669746 ERROR
  watcher.decision_engine.audit.base [None
  req-586ff844-b5be-4d3c-b7be-61f1947aebc7 - - - - - -]
  '287d4534-b817-42bb-ae65-567c446196ae': KeyError: '287d4534-b817-42bb-
  ae65-567c446196ae'

  2025-11-06 19:30:47.191 3669746 ERROR
  watcher.decision_engine.audit.base Traceback (most recent call last):

  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 145, in execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     solution = self.do_execute(audit, request_context)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/continuous.py", line 85, in do_execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     .do_execute(audit, request_context)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 83, in do_execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     solution = self.strategy_context.execute_strategy(
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/base.py", line 43, in execute_strategy
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     solution = self.do_execute_strategy(audit, request_context)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/default.py", line 70, in do_execute_strategy
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     return selected_strategy.execute(audit=audit)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/base.py", line 266, in execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     self.do_execute(audit=audit)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 303, in do_execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     instance_to_migrate = self.choose_instance_to_migrate(
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 162, in choose_instance_to_migrate
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     delta_workload - workload_cache[instance.uuid])
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base KeyError: '287d4534-b817-42bb-ae65-567c446196ae'
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base

  Basically the data is gnocchi is no longer relevant or missing once
  the instance is on another host.

  the command used to created the audit template is:

  `openstack optimize audittemplate create wlbalancetest
  workload_balancing --strategy workload_balance`

  the command to create the audit it:

  `openstack optimize audit create =a wlbalancetest -t CONTINUOUS
  --auto-trigger --interval 60 -p threshold=60.0 -p
  metrics=instance_cpu_usage`

  With several instances to evaluate, it will eventually put the audit
  in a failed state.

  Attached is a patch that allows the decision engine to continue when
  it finds the stale or missing data.  Testing with this patch in the
  same environment has results in over 150 successful action plans with
  no decision engine failures.

To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2131043/+subscriptions