[Bug 2086710] Re: watcher's use of apscheduler is incompatible with python 3.12 and eventlet
Myles Penner
2086710 at bugs.launchpad.net
Tue May 26 14:32:29 UTC 2026
** Patch added: "watcher_lp2086710.debdiff"
https://bugs.launchpad.net/ubuntu/+source/watcher/+bug/2086710/+attachment/5973699/+files/watcher_lp2086710.debdiff
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to watcher in Ubuntu.
https://bugs.launchpad.net/bugs/2086710
Title:
watcher's use of apscheduler is incompatible with python 3.12 and
eventlet
Status in watcher:
Fix Released
Status in watcher package in Ubuntu:
Fix Released
Status in watcher source package in Noble:
Triaged
Status in watcher source package in Oracular:
Won't Fix
Status in watcher source package in Plucky:
Fix Released
Bug description:
SRU Template
[Impact]
Watcher's decision-engine accumulates idle SQLAlchemy connections over time
and eventually exhausts its connection pool (size 2 + 50 overflow), causing
the service to report FAILED in `openstack optimize service list`. In a
production Sunbeam 2024.1 deployment this typically takes multiple days to
manifest. Once the pool is exhausted, all background jobs in the decision
engine fail with:
sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
reached, connection timed out, timeout 30.00
The only workarounds available to operators today are killing the sleeping
MySQL connections out from under watcher and restarting the watcher pods.
Once this happens, Watcher cannot reconcile audits, schedule action plans,
or run any continuous audit workload until manual intervention.
The update contains the following package updates:
* watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal)
[Test Case]
The following SRU process was followed:
https://documentation.ubuntu.com/sru/en/latest/reference/exception-OpenStack-Updates
In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the
proposed packages can be let into -updates.
The OpenStack team will be in charge of attaching the output summary
of the executed tests. The OpenStack team members will not mark
‘verification-done’ until this has happened.
------------------------------------------------------------------------
Check 1 -- No QueuePool TimeoutError in decision-engine logs
------------------------------------------------------------------------
Original symptom from the bug report:
[watcher-decision-engine] ERROR apscheduler.executors.default
sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
reached, connection timed out, timeout 30.00
Run:
sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \
--since=60m | grep -iEc "queuepool|TimeoutError"
Pass criterion: 0 (zero matches in the last 60 minutes of logs).
------------------------------------------------------------------------
Check 2 -- Sleeping MySQL connections are not accumulating
------------------------------------------------------------------------
First discover the watcher DB credentials (auto-generated, unique per
deployment):
sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine -- \
grep ^connection /etc/watcher/watcher.conf
The output line has the form:
connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router-
service...:6446/watcher_api
Extract <USER> and <PASS> from the URL. Then run the same query as
the bug report:
sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \
-c mysql-router -- \
mysql -u <USER> -p<PASS> \
-h watcher-mysql-router-service.openstack.svc.cluster.local \
-P 6446 watcher_api \
-e "SELECT count(*), state FROM information_schema.processlist
GROUP BY state;"
Run the query twice, ideally allowing watcher to run overnight between
samples.
Pass criterion: the count for the empty-state row (sleeping connections)
stays bounded (under ~15) across both samples and does not trend upward.
On the broken package, this count grows by ~4 per minute
and exceeds the pool ceiling of 52 within ~15 minutes after a pod
restart, at which point Check 3 fails.
------------------------------------------------------------------------
Check 3 -- watcher-decision-engine reports ACTIVE
------------------------------------------------------------------------
Run:
openstack optimize service list
Pass criterion: every watcher-decision-engine row shows Status =
ACTIVE.
[Regression Potential]
In order to mitigate the regression potential, the results of the
aforementioned OpenStack CI tests are attached to this bug.
The bulk of the change is further-database-refactoring.patch, which
rewrites the SQLAlchemy session lifecycle in watcher/db/sqlalchemy/api.py
to use the enginefacade reader/writer context managers. It is a direct
cherry-pick from upstream stable/2025.1 and has been in upstream watcher
since 14.0.0.0rc1 (February 2025) without revert. The prerequisite patch
(replace-deprecated-legacy-enginefacade.patch) is 8 lines and only swaps
a deprecated facade-construction call site.
Two regression modes are possible, both low risk:
* Session lifetime is now scoped to the context manager. Out-of-tree
code that uses ORM objects after the helper returns may raise
DetachedInstanceError. All in-tree callers were migrated as part
of the patch.
* The model_query() helper has been removed. Out-of-tree code that
imports it will fail at import time.
[Discussion]
The leak originates in watcher/db/sqlalchemy/api.py: model_query()
calls get_session() and returns the query without ever closing the
session. Every Audit.list() call therefore leaks one connection, and
ContinuousAuditHandler.launch_audits_periodically calls it twice per
tick. Upstream addressed this by migrating the DB layer to oslo_db's
enginefacade reader/writer context-manager API (Change-Id
Ib5e9aa288232cc1b766bbf2a8ce2113d5a8e2f7d, upstream LP #2067815),
which auto-closes sessions on context exit.
That fix shipped in 14.0.0.0rc1 (Epoxy / 2025.1) and is cherry-picked
here, together with its prerequisite (Change-Id
I5570698262617eae3f48cf29aacf2e23ad541e5f, "Replace deprecated
LegacyEngineFacade").
--------------------------------------------------------------------------------
Original Bug Report Content Below
--------------------------------------------------------------------------------
in the newton release a background job scheduler was added to the
Decision Engine.
https://github.com/openstack/watcher/commit/06c6c4691b103bf0b3fd3304a1a45fb22aedad50
to facilitate this the apscheduler lib was introduced as a depency to watcher.
apscheduler has a lost of capability but does not officially support eventlet.
since its introduction to watcher it has mostly worked partly by accident.
over the year as oslo, apscheduler and eventlet have evolved and adapted to newer python
release watcher has continued to use apscheduler even though that is not technically supported.
with the move to python 3.12 it became apparent that the background jobs executed on the apscheduler
BackgroundScheduler instances were accellign shared global state from a non-monkeypatched native thread.
that results in greenthread sometimes calling into objects that are
using un monkey patched code.
for example oslo.db uses time.sleep to yield executions.
when that oslo.db function is first imported from a non patched thread if its invoked after that in the main thread it will block.
this can by this expction "RuntimeError: do not call blocking
functions from the mainloop" here
https://paste.opendev.org/show/bGPgfURx1cZYOsgmtDyw/
this has been repdocuded in ci as part of moving the ci jobs to ubutnu
24.04 and python 3.12
https://review.opendev.org/c/openstack/watcher/+/932963/comments/f54005d7_b0f831bb
to address this issue we need to ensure that the background thread
used to schedule background task is properly monkey patched.
To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2086710/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list