[Bug 2131043] Re: watcher decision engine fails when evaluating previously migrated instance
Billy Olsen
2131043 at bugs.launchpad.net
Fri Nov 21 04:39:29 UTC 2025
** Also affects: watcher (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to watcher in Ubuntu.
https://bugs.launchpad.net/bugs/2131043
Title:
watcher decision engine fails when evaluating previously migrated
instance
Status in watcher:
New
Status in watcher package in Ubuntu:
New
Bug description:
jammy + caracal
openstack 2024.1 charms
using the workload_balance strategy, an audit will fail when a machine
previously evaluated has been migrated due to an action plan, but it
goes back to evaluate that machine. it no longer has metrics for that
machine and it causes the decision engine to crash.
More specifically, the machine is no longer in the workload_cache and
so it causes a key error. Logs when this issue happens are:
2025-11-06 19:30:47.191 3669746 ERROR
watcher.decision_engine.audit.base [None
req-586ff844-b5be-4d3c-b7be-61f1947aebc7 - - - - - -]
'287d4534-b817-42bb-ae65-567c446196ae': KeyError: '287d4534-b817-42bb-
ae65-567c446196ae'
2025-11-06 19:30:47.191 3669746 ERROR
watcher.decision_engine.audit.base Traceback (most recent call last):
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 145, in execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.do_execute(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/continuous.py", line 85, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base .do_execute(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 83, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.strategy_context.execute_strategy(
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/base.py", line 43, in execute_strategy
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.do_execute_strategy(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/default.py", line 70, in do_execute_strategy
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base return selected_strategy.execute(audit=audit)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/base.py", line 266, in execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base self.do_execute(audit=audit)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 303, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base instance_to_migrate = self.choose_instance_to_migrate(
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 162, in choose_instance_to_migrate
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base delta_workload - workload_cache[instance.uuid])
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base KeyError: '287d4534-b817-42bb-ae65-567c446196ae'
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base
Basically the data is gnocchi is no longer relevant or missing once
the instance is on another host.
the command used to created the audit template is:
`openstack optimize audittemplate create wlbalancetest
workload_balancing --strategy workload_balance`
the command to create the audit it:
`openstack optimize audit create =a wlbalancetest -t CONTINUOUS
--auto-trigger --interval 60 -p threshold=60.0 -p
metrics=instance_cpu_usage`
With several instances to evaluate, it will eventually put the audit
in a failed state.
Attached is a patch that allows the decision engine to continue when
it finds the stale or missing data. Testing with this patch in the
same environment has results in over 150 successful action plans with
no decision engine failures.
To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2131043/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list