[Bug 1961402] Re: Hanging services when connectivity to RabbitMQ lost

Mon Jan 29 18:56:13 UTC 2024

Reviewed:  https://review.opendev.org/c/openstack/oslo.messaging/+/854904
Committed: https://opendev.org/openstack/oslo.messaging/commit/44b3427bc9efea9f341edfb8ea7aea38f25d1a5a
Submitter: "Zuul (22348)"
Branch:    stable/yoga

commit 44b3427bc9efea9f341edfb8ea7aea38f25d1a5a
Author: Slawek Kaplonski <skaplons at redhat.com>
Date:   Fri Aug 5 12:40:40 2022 +0200

    Change default value of "heartbeat_in_pthread" to False

    As was reported in the related bug some time ago, setting that
    option to True for nova-compute can break it as it's non-wsgi service.
    We also noticed same problems with randomly stucked non-wsgi services
    like e.g. neutron agents and probably the same issue can happen with
    any other non-wsgi service.

    To avoid that this patch changes default value of that config option
    to be False.
    Together with [1] it effectively reverts change done in [2] some time
    ago.

    [1] https://review.opendev.org/c/openstack/oslo.messaging/+/800621
    [2] https://review.opendev.org/c/openstack/oslo.messaging/+/747395

    Related-Bug: #1934937
    Closes-Bug: #1961402

    Change-Id: I85f5b9d1b5d15ad61a9fcd6e25925b7eeb8bf6e7

** Tags added: in-stable-yoga

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1961402

Title:
  Hanging services when connectivity to RabbitMQ lost

Status in Ubuntu Cloud Archive:
  New
Status in oslo.messaging:
  Fix Released

Bug description:
  # Versions
  - oslo.messaging 12.9.1
  - rabbitmq 3.9.8
  - ubuntu 20.04

  Hi,
  We are observing issues with services recovering if they encounter connectivity issues to the RabbitMQ cluster. We have seen this across Nova, Neutron and Cinder services in particular, across all of our deployments. When this occurs, the following greenlet related traceback is always seen in the service logs, following a number of reconnection related messages (example for Nova compute):

  Feb 18 08:42:33 compute102 nova-compute[1402787]: 2022-02-18 08:42:33.514 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
  Feb 18 08:42:34 compute102 nova-compute[1402787]: 2022-02-18 08:42:34.517 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.050 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: . Trying again in 1 seconds.: socket.timeout
  Feb 18 08:42:35 compute102 nova-compute[1402787]: 2022-02-18 08:42:35.520 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.052 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:36 compute102 nova-compute[1402787]: 2022-02-18 08:42:36.521 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:37 compute102 nova-compute[1402787]: 2022-02-18 08:42:37.053 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.055 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 2 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:38 compute102 nova-compute[1402787]: 2022-02-18 08:42:38.524 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:39 compute102 nova-compute[1402787]: 2022-02-18 08:42:39.526 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.058 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:40 compute102 nova-compute[1402787]: 2022-02-18 08:42:40.527 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:41 compute102 nova-compute[1402787]: 2022-02-18 08:42:41.060 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:42 compute102 nova-compute[1402787]: 2022-02-18 08:42:42.062 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 4 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:44 compute102 nova-compute[1402787]: 2022-02-18 08:42:44.532 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:45 compute102 nova-compute[1402787]: 2022-02-18 08:42:45.534 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.067 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:46 compute102 nova-compute[1402787]: 2022-02-18 08:42:46.536 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:47 compute102 nova-compute[1402787]: 2022-02-18 08:42:47.068 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:48 compute102 nova-compute[1402787]: 2022-02-18 08:42:48.070 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 6 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:52 compute102 nova-compute[1402787]: 2022-02-18 08:42:52.543 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:53 compute102 nova-compute[1402787]: 2022-02-18 08:42:53.545 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.077 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.99:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:54 compute102 nova-compute[1402787]: 2022-02-18 08:42:54.546 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] [c21a4649-bc17-4648-82b4-e88743d61fc9] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:55 compute102 nova-compute[1402787]: 2022-02-18 08:42:55.079 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.98:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 1 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:56 compute102 nova-compute[1402787]: 2022-02-18 08:42:56.080 1402787 ERROR oslo.messaging._drivers.impl_rabbit [req-85fc671a-18e5-4b0d-9c4b-27562efafd2a - - - - -] [229e749d-adb2-4375-87ca-2f7235129935] AMQP server on 10.99.99.97:5671 is unreachable: [Errno 101] ENETUNREACH. Trying again in 8 seconds.: OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.700 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
  Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.701 1402787 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 110] Connection timed out
  Feb 18 08:42:58 compute102 nova-compute[1402787]: 2022-02-18 08:42:58.702 1402787 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 101] ENETUNREACH (retrying in 0 seconds): OSError: [Errno 101] ENETUNREACH
  Feb 18 08:42:58 compute102 nova-compute[1402787]: Traceback (most recent call last):
  Feb 18 08:42:58 compute102 nova-compute[1402787]:   File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
  Feb 18 08:42:58 compute102 nova-compute[1402787]:     timer()
  Feb 18 08:42:58 compute102 nova-compute[1402787]:   File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/hubs/timer.py", line 59, in __call__
  Feb 18 08:42:58 compute102 nova-compute[1402787]:     cb(*args, **kw)
  Feb 18 08:42:58 compute102 nova-compute[1402787]:   File "/openstack/venvs/nova-24.0.0.0rc1/lib/python3.8/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
  Feb 18 08:42:58 compute102 nova-compute[1402787]:     waiter.switch()
  Feb 18 08:42:58 compute102 nova-compute[1402787]: greenlet.error: cannot switch to a different thread

  Typically if the RabbitMQ cluster is taken down this will impact ~5%
  of the services in the deployment, all of which will need to be
  restarted in order to recover. Similar recovery issues have been seen
  if the host's network interface is taken down and brought back up (as
  used to generate the above traceback).

  As far as we can tell this started to occur at a similar time to
  https://bugs.launchpad.net/oslo.messaging/+bug/1949964, so around the
  time of the Wallaby OpenStack release, and coinciding with a switch
  from TLSv1.0/v1.1 to TLSv1.2 in our RabbitMQ connections, plus a
  switch to using a full PKI infrastructure with certificate validation,
  rather than ignoring certificate errors.

  Any suggestions for diagnosing this further would be appreciated.

  Thanks

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1961402/+subscriptions