[Bug 2091947] Re: [SRU] Watcher crashes on creation of multiple audits and gets stuck in PENDING

Fri May 23 16:27:26 UTC 2025

Hi Heitor! Thanks for the first pass review.

You're right that the debian/changelog entries are not consistently formatted. I was following James Page's guidance which he provided in this comment:
https://bugs.launchpad.net/ubuntu/+source/watcher/+bug/2088620/comments/16

"When the most recent changelog entry is UNRELEASED you should add you
changelog update as part of the existing entry rather than creating a
new one"

This is why for Zed and Caracal the debdiffs are labelled UNRELEASED and merged with a previous entry
Current Zed changelog is UNRELEASED: https://git.launchpad.net/~ubuntu-openstack-dev/ubuntu/+source/watcher/tree/debian/changelog?h=stable/zed
Current Caracal changelog is UNRELEASED: https://git.launchpad.net/~ubuntu-openstack-dev/ubuntu/+source/watcher/tree/debian/changelog?h=stable/2024.1

So, I remain unclear on what to do for Zed and Caracal as the previous
entry is UNRELEASED. Should I leave the previous entry unchanged and as
UNRELEASED, then create a new entry with the correct release name
(jammy-zed, jammy-caracal), and increment the version without adding
another ubuntu1?

Admittedly, I see that for Bobcat and Antelope, the releases are jammy-
bobcat and jammy-antelope so I should have specified the release. I have
another in-flight SRU for Watcher that's currently in proposed
(https://bugs.launchpad.net/ubuntu/+source/watcher/+bug/2088620), which
would be blocking this one. In that case my changelog entry was merged
with the existing jammy-antelope/bobcat entry with no change to the
version: eg. 2:11.0.0-0ubuntu1.1~cloud0 for bobcat.

The customer behind both SRUs has stopped responding, so I suggest that
we let the in-flight SRU pass through, then I'll regenerate these
debdiffs with your corrections on top of that. When the time comes, I
assume that for bobcat and antelope, since the release is properly
specified, I should approach versioning and merging of changelog entries
in the same way that was done with the SRU that's in proposed (keep the
same version, and add the changelog entry to the existing one)?

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/2091947

Title:
  [SRU] Watcher crashes on creation of multiple audits and gets stuck in
  PENDING

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive caracal series:
  New
Status in Ubuntu Cloud Archive dalmatian series:
  Fix Released
Status in Ubuntu Cloud Archive epoxy series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Incomplete
Status in Ubuntu Cloud Archive zed series:
  New
Status in watcher package in Ubuntu:
  Fix Released
Status in watcher source package in Focal:
  Confirmed
Status in watcher source package in Jammy:
  Incomplete
Status in watcher source package in Noble:
  Confirmed
Status in watcher source package in Oracular:
  Fix Released
Status in watcher source package in Plucky:
  Fix Released

Bug description:
  [ Impact ]

    * The watcher releases targeted by this SRU are experiencing a bug
  where you can only create one audit of type CONTINUOUS. Any
  subsequently created audits end up getting stuck in a pending state.
  The root cause of this error is the conversion of an improperly typed
  date which causes watcher to crash. The function converting the date
  format, utc_timestamp_to_datetime, expects the timestamp to be of type
  float but Watcher has been passing the date as a decimal object. The
  patch at [1] correctly typecasts to float before converting to a
  datetime object

    * The commit landed upstream in 2024.2.

  [ Test Plan ]

    * Deploy openstack yoga on jammy with watcher and gnocchi services

    * Create two watcher audits of CONTINUOUS type and monitor their status
      openstack optimize audit create --name test_audit_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
      openstack optimize audit create --name test_audit_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger
      openstack optimize audit list

    * Without the patch, the second audit will get stuch in state
  PENDING and systemctl status watcher-decision-engine.service reveals
  that a crash occured. With the patch, both audits successfully enter a
  state of "ONGOING"

  [ What can go wrong ]

    * This commit overrides the apscheduler's implementation of
  get_next_run_time, since the apscheduler's implementation obtains the
  decimal.Decimal object which crashes the engine. This should expand
  compatibility to include SQLAlchemy 2.0 but may have otherwise have
  effects. It shouldn't since the function it's overriding is what
  precipitates the issue but it may affect legacy software (eg. older
  SQLAlchemy)

  [1]
  https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c

  --------------------------------------
  Original Description:

  A customer is facing an issue where the watcher-decision-engine
  service crashes when creating an audit plan with the Audit type set to
  CONTINUOUS. Below are the steps to reproduce the issue:

  Environment Details:
  1. Deploy Openstack Yoga on Jammy with Watcher and Gnocchi as watcher's storage backend

  2. Create an audit
  openstack optimize audit create --name workload_stabilization_test_1 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger

  3. Check the audit state
  openstack optimize audit list
  Observe it says "CONTINUOUS ONGOING"

  4. Create a second audit
  openstack optimize audit create --name workload_stabilization_test_2 -s workload_stabilization -g workload_balancing --audit_type CONTINUOUS --interval 60 --auto-trigger

  5. Check the audit state
  openstack optimize audit list
  Observe the second audit is stuck in "CONTINUOUS PENDING"

  6. Check watcher's status and observe that it crashed with the following traceback
  systemctl status watcher-decision-engine.service

  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:     self.run()
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:   File "/usr/lib/python3.10/threading.py", line 953, in run
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:     self._target(*self._args, **self._kwargs)
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:   File "/usr/lib/python3/dist-packages/apscheduler/schedulers/blocking.py", line 32, in _main_loop
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:     wait_seconds = self._process_jobs()
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:   File "/usr/lib/python3/dist-packages/apscheduler/schedulers/base.py", line 1006, in _process_jobs
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:     jobstore_next_run_time = jobstore.get_next_run_time()
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:   File "/usr/lib/python3/dist-packages/apscheduler/jobstores/sqlalchemy.py", line 84, in get_next_run_time
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]:     return utc_timestamp_to_datetime(float(next_run_time))
  Nov 27 19:53:54 juju-2752e1-86-lxd-27 watcher-decision-engine[965896]: TypeError: float() argument must be a string or a real number, not 'NoneType'

  This was fixed upstream in 2024.2 at
  https://opendev.org/openstack/watcher/commit/d6f169197efc5b4f6c8a2e6bc38177b0641ca05c
  which properly addresses the type conversion

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2091947/+subscriptions