[Bug 2131043] Re: watcher decision engine fails when evaluating previously migrated instance

Mon Jan 19 14:42:37 UTC 2026

I have made some tests on the Charmed OpenStack Caracal release. When
the 'watcher_cluster_data_model_collectors' periods are shorter than a
continuous audit's interval, subsequent live migrations of the same VM
are working fine. I wasn't able to reproduce the problem for such a
scenario.

When testing with the 'watcher_cluster_data_model_collectors' periods
set to 3600s, and the audit interval of 60s, after the first successful
VM migration, subsequent migrations fail due to the fact that Watcher
cluster model is out of sync, and Watcher incorrectly picks a migration
target, which happens to be the node currently hosting the VM. However,
in my test environment, I have only three compute nodes.

I submitted a patch for the Watcher charm [1][2], lowering the default data model refresh interval from 60 to 30 minutes. I also added some extra comments about the implications of using a long refresh interval.

Regarding the upstream Watcher project, I proposed a minor documentation
update [3], and I am looking forward to the comments/reviews.

[1] https://review.opendev.org/c/openstack/charm-watcher/+/973822
[2] https://launchpad.net/bugs/2138626
[3] https://review.opendev.org/c/openstack/watcher/+/973839

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to watcher in Ubuntu.
https://bugs.launchpad.net/bugs/2131043

Title:
  watcher decision engine fails when evaluating previously migrated
  instance

Status in watcher:
  New
Status in watcher package in Ubuntu:
  Confirmed

Bug description:
  jammy + caracal
  openstack 2024.1 charms

  using the workload_balance strategy, an audit will fail when a machine
  previously evaluated has been migrated due to an action plan, but it
  goes back to evaluate that machine.  it no longer has metrics for that
  machine and it causes the decision engine to crash.

  More specifically, the machine is no longer in the workload_cache and
  so it causes a key error.  Logs when this issue happens are:

  2025-11-06 19:30:47.191 3669746 ERROR
  watcher.decision_engine.audit.base [None
  req-586ff844-b5be-4d3c-b7be-61f1947aebc7 - - - - - -]
  '287d4534-b817-42bb-ae65-567c446196ae': KeyError: '287d4534-b817-42bb-
  ae65-567c446196ae'

  2025-11-06 19:30:47.191 3669746 ERROR
  watcher.decision_engine.audit.base Traceback (most recent call last):

  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 145, in execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     solution = self.do_execute(audit, request_context)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/continuous.py", line 85, in do_execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     .do_execute(audit, request_context)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/audit/base.py", line 83, in do_execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     solution = self.strategy_context.execute_strategy(
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/base.py", line 43, in execute_strategy
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     solution = self.do_execute_strategy(audit, request_context)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/context/default.py", line 70, in do_execute_strategy
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     return selected_strategy.execute(audit=audit)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/base.py", line 266, in execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     self.do_execute(audit=audit)
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 303, in do_execute
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     instance_to_migrate = self.choose_instance_to_migrate(
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base   File "/usr/lib/python3/dist-packages/watcher/decision_engine/strategy/strategies/workload_balance.py", line 162, in choose_instance_to_migrate
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base     delta_workload - workload_cache[instance.uuid])
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base KeyError: '287d4534-b817-42bb-ae65-567c446196ae'
  2025-11-06 19:30:47.191 3669746 ERROR watcher.decision_engine.audit.base

  Basically the data is gnocchi is no longer relevant or missing once
  the instance is on another host.

  the command used to created the audit template is:

  `openstack optimize audittemplate create wlbalancetest
  workload_balancing --strategy workload_balance`

  the command to create the audit it:

  `openstack optimize audit create =a wlbalancetest -t CONTINUOUS
  --auto-trigger --interval 60 -p threshold=60.0 -p
  metrics=instance_cpu_usage`

  With several instances to evaluate, it will eventually put the audit
  in a failed state.

  Attached is a patch that allows the decision engine to continue when
  it finds the stale or missing data.  Testing with this patch in the
  same environment has results in over 150 successful action plans with
  no decision engine failures.

To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2131043/+subscriptions