[Bug 2131043] Re: watcher decision engine fails when evaluating previously migrated instance
Marcin Wilk
2131043 at bugs.launchpad.net
Mon Jan 19 14:42:37 UTC 2026
I have made some tests on the Charmed OpenStack Caracal release. When
the 'watcher_cluster_data_model_collectors' periods are shorter than a
continuous audit's interval, subsequent live migrations of the same VM
are working fine. I wasn't able to reproduce the problem for such a
scenario.
When testing with the 'watcher_cluster_data_model_collectors' periods
set to 3600s, and the audit interval of 60s, after the first successful
VM migration, subsequent migrations fail due to the fact that Watcher
cluster model is out of sync, and Watcher incorrectly picks a migration
target, which happens to be the node currently hosting the VM. However,
in my test environment, I have only three compute nodes.
I submitted a patch for the Watcher charm [1][2], lowering the default data model refresh interval from 60 to 30 minutes. I also added some extra comments about the implications of using a long refresh interval.
Regarding the upstream Watcher project, I proposed a minor documentation
update [3], and I am looking forward to the comments/reviews.
[1] https://review.opendev.org/c/openstack/charm-watcher/+/973822
[2] https://launchpad.net/bugs/2138626
[3] https://review.opendev.org/c/openstack/watcher/+/973839
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to watcher in Ubuntu.
https://bugs.launchpad.net/bugs/2131043
Title:
watcher decision engine fails when evaluating previously migrated
instance
Status in watcher:
New
Status in watcher package in Ubuntu:
Confirmed
Bug description:
jammy + caracal
openstack 2024.1 charms
using the workload_balance strategy, an audit will fail when a machine
previously evaluated has been migrated due to an action plan, but it
goes back to evaluate that machine. it no longer has metrics for that
machine and it causes the decision engine to crash.
More specifically, the machine is no longer in the workload_cache and
so it causes a key error. Logs when this issue happens are:
2025-11-06 19:30:47.191 3669746 ERROR
watcher.decision_engine.audit.base [None
req-586ff844-b5be-4d3c-b7be-61f1947aebc7 - - - - - -]
'287d4534-b817-42bb-ae65-567c446196ae': KeyError: '287d4534-b817-42bb-
ae65-567c446196ae'
2025-11-06 19:30:47.191 3669746 ERROR
watcher.decision_engine.audit.base Traceback (most recent call last):
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 145, in execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.do_execute(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/continuous.py", line 85, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base .do_execute(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 83, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.strategy_context.execute_strategy(
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/base.py", line 43, in execute_strategy
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base solution = self.do_execute_strategy(audit, request_context)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/default.py", line 70, in do_execute_strategy
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base return selected_strategy.execute(audit=audit)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/base.py", line 266, in execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base self.do_execute(audit=audit)
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 303, in do_execute
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base instance_to_migrate = self.choose_instance_to_migrate(
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 162, in choose_instance_to_migrate
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base delta_workload - workload_cache[instance.uuid])
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base KeyError: '287d4534-b817-42bb-ae65-567c446196ae'
2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base
Basically the data is gnocchi is no longer relevant or missing once
the instance is on another host.
the command used to created the audit template is:
`openstack optimize audittemplate create wlbalancetest
workload_balancing --strategy workload_balance`
the command to create the audit it:
`openstack optimize audit create =a wlbalancetest -t CONTINUOUS
--auto-trigger --interval 60 -p threshold=60.0 -p
metrics=instance_cpu_usage`
With several instances to evaluate, it will eventually put the audit
in a failed state.
Attached is a patch that allows the decision engine to continue when
it finds the stale or missing data. Testing with this patch in the
same environment has results in over 150 successful action plans with
no decision engine failures.
To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2131043/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list