[Bug 2131043] Re: watcher decision engine fails when evaluating previously migrated instance
Alfredo Moralejo
2131043 at bugs.launchpad.net
Thu Dec 4 12:02:28 UTC 2025
Watcher contains an internal model of your cluster topology, including
instances and compute nodes.
When no notifications are enabled, that model is updated periodically
based on your configuration:
[watcher_cluster_data_model_collectors.baremetal]
period = 3600
[watcher_cluster_data_model_collectors.compute]
period = 3600
[watcher_cluster_data_model_collectors.storage]
period = 3600
In your case it's being updated every hour, while you are running your
audit every minute so it will very easily hit issues because the model
is out of date. Even if not getting that error, you may get unexpected
results.
On improvement you can do is to decrease the period to something close
to your interval. You could i.e. use
--interval 120 and set:
[watcher_cluster_data_model_collectors.baremetal]
period = 120
[watcher_cluster_data_model_collectors.compute]
period = 120
[watcher_cluster_data_model_collectors.storage]
period = 120
Alternatively to that schedule-based clustermodel sync, watcher has the ability to synchronize the model based on nova notification events. i.e. when a new vm is created, deleted, etc... nova can send a notification message into a queue and watcher consumes it and update the model. For that you need to enable notifications both in nova and watcher side. In watcher you need to set:
[oslo_messaging_notifications]
driver = messagingv2
transport_url = <url for notifications messaging broker>
I'd also recomend to set in [DEFAULT] section:
notification_level =
Note you also need to configure oslo_messaging_notifications in nova
service.
Usually, a recommendation is to use separated messaging broker instance
different to the one for RPCs between services.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to watcher in Ubuntu.
https://bugs.launchpad.net/bugs/2131043
Title:
watcher decision engine fails when evaluating previously migrated
instance
Status in watcher:
New
Status in watcher package in Ubuntu:
New
Bug description:
jammy + caracal
openstack 2024.1 charms
using the workload_balance strategy, an audit will fail when a machine
previously evaluated has been migrated due to an action plan, but it
goes back to evaluate that machine. it no longer has metrics for that
machine and it causes the decision engine to crash.
More specifically, the machine is no longer in the workload_cache and
so it causes a key error. Logs when this issue happens are:
2025-11-06 19:30:47.191 3669746 ERROR
watcher.decision_engine.audit.base [None
req-586ff844-b5be-4d3c-b7be-61f1947aebc7 - - - - - -]
'287d4534-b817-42bb-ae65-567c446196ae': KeyError: '287d4534-b817-42bb-
ae65-567c446196ae'
2025-11-06 19:30:47.191 3669746 ERROR
watcher.decision_engine.audit.base Traceback (most recent call last):
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 145, in execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.do_execute(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/continuous.py", line 85, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base .do_execute(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 83, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.strategy_context.execute_strategy(
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/base.py", line 43, in execute_strategy
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.do_execute_strategy(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/default.py", line 70, in do_execute_strategy
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base return selected_strategy.execute(audit=audit)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/base.py", line 266, in execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base self.do_execute(audit=audit)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 303, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base instance_to_migrate = self.choose_instance_to_migrate(
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 162, in choose_instance_to_migrate
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base delta_workload - workload_cache[instance.uuid])
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base KeyError: '287d4534-b817-42bb-ae65-567c446196ae'
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base
Basically the data is gnocchi is no longer relevant or missing once
the instance is on another host.
the command used to created the audit template is:
`openstack optimize audittemplate create wlbalancetest
workload_balancing --strategy workload_balance`
the command to create the audit it:
`openstack optimize audit create =a wlbalancetest -t CONTINUOUS
--auto-trigger --interval 60 -p threshold=60.0 -p
metrics=instance_cpu_usage`
With several instances to evaluate, it will eventually put the audit
in a failed state.
Attached is a patch that allows the decision engine to continue when
it finds the stale or missing data. Testing with this patch in the
same environment has results in over 150 successful action plans with
no decision engine failures.
To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2131043/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list