[Bug 2086710] Re: watcher's use of apscheduler is incompatible with python 3.12 and eventlet

Tue May 26 17:03:56 UTC 2026

** Changed in: watcher (Ubuntu Noble)
     Assignee: (unassigned) => Myles Penner (mylesjp)

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to watcher in Ubuntu.
https://bugs.launchpad.net/bugs/2086710

Title:
  watcher's use of apscheduler is incompatible with python 3.12 and
  eventlet

Status in watcher:
  Fix Released
Status in watcher package in Ubuntu:
  Fix Released
Status in watcher source package in Noble:
  Triaged
Status in watcher source package in Oracular:
  Won't Fix
Status in watcher source package in Plucky:
  Fix Released

Bug description:
  SRU Template

  [Impact]
  Watcher's decision-engine accumulates idle SQLAlchemy connections over time
  and eventually exhausts its connection pool (size 2 + 50 overflow), causing
  the service to report FAILED in `openstack optimize service list`. In a
  production Sunbeam 2024.1 deployment this typically takes multiple days to
  manifest. Once the pool is exhausted, all background jobs in the decision
  engine fail with:

    sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
    reached, connection timed out, timeout 30.00

  The only workarounds available to operators today are killing the sleeping
  MySQL connections out from under watcher and restarting the watcher pods.
  Once this happens, Watcher cannot reconcile audits, schedule action plans,
  or run any continuous audit workload until manual intervention.

  The update contains the following package updates:

    * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal)

  [Test Case]
  The following SRU process was followed:
  https://documentation.ubuntu.com/sru/en/latest/reference/exception-OpenStack-Updates

  In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed.  A successful run of all available tests will be required before the
  proposed packages can be let into -updates.

  The OpenStack team will be in charge of attaching the output summary
  of the executed tests. The OpenStack team members will not mark
  ‘verification-done’ until this has happened.

  ------------------------------------------------------------------------
  Check 1 -- No QueuePool TimeoutError in decision-engine logs
  ------------------------------------------------------------------------

  Original symptom from the bug report:

      [watcher-decision-engine] ERROR apscheduler.executors.default
      sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
      reached, connection timed out, timeout 30.00

  Run:

      sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \
          --since=60m | grep -iEc "queuepool|TimeoutError"

  Pass criterion: 0 (zero matches in the last 60 minutes of logs).

  ------------------------------------------------------------------------
  Check 2 -- Sleeping MySQL connections are not accumulating
  ------------------------------------------------------------------------

  First discover the watcher DB credentials (auto-generated, unique per
  deployment):

      sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine -- \
          grep ^connection /etc/watcher/watcher.conf

  The output line has the form:

      connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router-
  service...:6446/watcher_api

  Extract <USER> and <PASS> from the URL. Then run the same query as
  the bug report:

      sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \
          -c mysql-router -- \
          mysql -u <USER> -p<PASS> \
          -h watcher-mysql-router-service.openstack.svc.cluster.local \
          -P 6446 watcher_api \
          -e "SELECT count(*), state FROM information_schema.processlist
              GROUP BY state;"

  Run the query twice, ideally allowing watcher to run overnight between
  samples.

  Pass criterion: the count for the empty-state row (sleeping connections)
  stays bounded (under ~15) across both samples and does not trend upward.

  On the broken package, this count grows by ~4 per minute
  and exceeds the pool ceiling of 52 within ~15 minutes after a pod
  restart, at which point Check 3 fails.

  ------------------------------------------------------------------------
  Check 3 -- watcher-decision-engine reports ACTIVE
  ------------------------------------------------------------------------

  Run:
      openstack optimize service list

  Pass criterion: every watcher-decision-engine row shows Status =
  ACTIVE.

  [Regression Potential]
  In order to mitigate the regression potential, the results of the
  aforementioned OpenStack CI tests are attached to this bug.

  The bulk of the change is further-database-refactoring.patch, which
  rewrites the SQLAlchemy session lifecycle in watcher/db/sqlalchemy/api.py
  to use the enginefacade reader/writer context managers. It is a direct
  cherry-pick from upstream stable/2025.1 and has been in upstream watcher
  since 14.0.0.0rc1 (February 2025) without revert. The prerequisite patch
  (replace-deprecated-legacy-enginefacade.patch) is 8 lines and only swaps
  a deprecated facade-construction call site.

  Two regression modes are possible, both low risk:
   * Session lifetime is now scoped to the context manager. Out-of-tree
     code that uses ORM objects after the helper returns may raise
     DetachedInstanceError. All in-tree callers were migrated as part
     of the patch.

   * The model_query() helper has been removed. Out-of-tree code that
     imports it will fail at import time.

  [Discussion]
  The leak originates in watcher/db/sqlalchemy/api.py: model_query()
  calls get_session() and returns the query without ever closing the
  session. Every Audit.list() call therefore leaks one connection, and
  ContinuousAuditHandler.launch_audits_periodically calls it twice per
  tick. Upstream addressed this by migrating the DB layer to oslo_db's
  enginefacade reader/writer context-manager API (Change-Id
  Ib5e9aa288232cc1b766bbf2a8ce2113d5a8e2f7d, upstream LP #2067815), 
  which auto-closes sessions on context exit.
  That fix shipped in 14.0.0.0rc1 (Epoxy / 2025.1) and is cherry-picked
  here, together with its prerequisite (Change-Id
  I5570698262617eae3f48cf29aacf2e23ad541e5f, "Replace deprecated
  LegacyEngineFacade").

  --------------------------------------------------------------------------------
  Original Bug Report Content Below
  --------------------------------------------------------------------------------

  in the newton release a background job scheduler was added to the
  Decision Engine.

  https://github.com/openstack/watcher/commit/06c6c4691b103bf0b3fd3304a1a45fb22aedad50

  to facilitate this the apscheduler lib was introduced as a depency to watcher.
  apscheduler has a lost of capability but does not officially support eventlet.

  since its introduction to watcher it has mostly worked partly by accident.
  over the year as oslo, apscheduler and eventlet have evolved and adapted to newer python
  release watcher has continued to use apscheduler even though that is not technically supported.

  with the move to python 3.12 it became apparent that the background jobs executed on the apscheduler
  BackgroundScheduler instances were accellign shared global state from a non-monkeypatched native thread.

  that results in greenthread sometimes calling into objects that are
  using un monkey patched code.

  for example oslo.db uses time.sleep to yield executions.
  when that oslo.db function is first imported from a non patched thread if its invoked after that in the main thread it will block.

  this can by this expction "RuntimeError: do not call blocking
  functions from the mainloop" here
  https://paste.opendev.org/show/bGPgfURx1cZYOsgmtDyw/

  this has been repdocuded in ci as part of moving the ci jobs to ubutnu
  24.04 and python 3.12

  https://review.opendev.org/c/openstack/watcher/+/932963/comments/f54005d7_b0f831bb

  to address this issue we need to ensure that the background thread
  used to schedule background task is properly monkey patched.

To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2086710/+subscriptions