[Bug 1961402] Re: Hanging services when connectivity to RabbitMQ lost
OpenStack Infra
1961402 at bugs.launchpad.net
Fri Oct 13 16:27:34 UTC 2023
Reviewed: https://review.opendev.org/c/openstack/oslo.messaging/+/880188
Committed: https://opendev.org/openstack/oslo.messaging/commit/4d15b7c4fe0c14e285484d23c15fe5531e952679
Submitter: "Zuul (22348)"
Branch: stable/zed
commit 4d15b7c4fe0c14e285484d23c15fe5531e952679
Author: Arnaud Morin <arnaud.morin at ovhcloud.com>
Date: Fri Mar 3 11:16:56 2023 +0100
Disable greenthreads for RabbitDriver "listen" connections
When enabling heartbeat_in_pthread, we were restoring the "threading"
python library from eventlet to original one in RabbitDriver but we
forgot to do the same in AMQPDriverBase (RabbitDriver is subclass of
AMQPDriverBase).
We also need to use the original "queue" so that queues are not going to
use greenthreads as well.
Related-bug: #1961402
Related-bug: #1934937
Closes-bug: #2009138
Signed-off-by: Arnaud Morin <arnaud.morin at ovhcloud.com>
Change-Id: I34ea0d1381e934297df2f793e0d2594ef8254f00
(cherry picked from commit 3645839d162a989da46ba1c0fcbe7f32b48b9fb1)
** Tags added: in-stable-zed
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1961402
Title:
Hanging services when connectivity to RabbitMQ lost
Status in Ubuntu Cloud Archive:
New
Status in oslo.messaging:
Fix Released
Bug description:
# Versions
- oslo.messaging 12.9.1
- rabbitmq 3.9.8
- ubuntu 20.04
Hi,
We are observing issues with services recovering if they encounter connectivity issues to the RabbitMQ cluster. We have seen this across Nova, Neutron and Cinder services in particular, across all of our deployments. When this occurs, the following greenlet related traceback is always seen in the service logs, following a number of reconnection related messages (example for Nova compute):
Feb 18 08:42:33 compute102 nova-compute[1402787]: 2022-02-18 08:42:33.514 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
Feb 18 08:42:34 compute102 nova-compute[1402787]: 2022-02-18 08:42:34.517 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.050 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.520 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.052 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.521 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:37 compute102 nova-compute[1402787]: 2022-02-18 08:42:37.053 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.055 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.524 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:39 compute102 nova-compute[1402787]: 2022-02-18 08:42:39.526 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.058 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.527 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:41 compute102 nova-compute[1402787]: 2022-02-18 08:42:41.060 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:42 compute102 nova-compute[1402787]: 2022-02-18 08:42:42.062 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:44 compute102 nova-compute[1402787]: 2022-02-18 08:42:44.532 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:45 compute102 nova-compute[1402787]: 2022-02-18 08:42:45.534 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.067 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.536 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:47 compute102 nova-compute[1402787]: 2022-02-18 08:42:47.068 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:48 compute102 nova-compute[1402787]: 2022-02-18 08:42:48.070 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:52 compute102 nova-compute[1402787]: 2022-02-18 08:42:52.543 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:53 compute102 nova-compute[1402787]: 2022-02-18 08:42:53.545 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.077 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.546 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:55 compute102 nova-compute[1402787]: 2022-02-18 08:42:55.079 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:56 compute102 nova-compute[1402787]: 2022-02-18 08:42:56.080 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.700 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.701 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.702 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 101] ENETUNREACH (retrying in 0 seconds): OSError: [Errno 101] ENETUNREACH
Feb 18 08:42:58 compute102 nova-compute[1402787]: Traceback (most recent call last):
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
Feb 18 08:42:58 compute102 nova-compute[1402787]: timer()
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__
Feb 18 08:42:58 compute102 nova-compute[1402787]: cb(*args, **kw)
Feb 18 08:42:58 compute102 nova-compute[1402787]: File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
Feb 18 08:42:58 compute102 nova-compute[1402787]: waiter.switch()
Feb 18 08:42:58 compute102 nova-compute[1402787]: greenlet.error: cannot switch to a different thread
Typically if the RabbitMQ cluster is taken down this will impact ~5%
of the services in the deployment, all of which will need to be
restarted in order to recover. Similar recovery issues have been seen
if the host's network interface is taken down and brought back up (as
used to generate the above traceback).
As far as we can tell this started to occur at a similar time to
https://bugs.launchpad.net/oslo.messaging/+bug/1949964, so around the
time of the Wallaby OpenStack release, and coinciding with a switch
from TLSv1.0/v1.1 to TLSv1.2 in our RabbitMQ connections, plus a
switch to using a full PKI infrastructure with certificate validation,
rather than ignoring certificate errors.
Any suggestions for diagnosing this further would be appreciated.
Thanks
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1961402/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list