[Bug 1993149] Re: VMs stay stuck in scheduling when rabbitmq leader unit is down

Fri Oct 28 03:04:36 UTC 2022

I can confirm this fixes the issue encountered on my lab and on a bigger deployment on Focal-Yoga.
I will try to do a quick test on Jammy Yoga and Zed

Just a quick comment, after installing the package update, it is necessary to restart many services to ensure they use the updated python library, before testing any rabbitMQ interruption.
This is the list of services I restart to ensure the fix is working :
juju run -a nova-compute sudo systemctl restart nova-compute nova-api-metadata ceilometer-agent
juju run -a nova-cloud-controller sudo systemctl restart nova-scheduler nova-conductor apache2
juju run -a neutron-api sudo systemctl restart neutron-server
juju run -a glance sudo systemctl restart glance-api
juju run -a cinder sudo systemctl restart cinder-volume cinder-scheduler apache2
juju run -a octavia sudo systemctl restart octavia-worker
juju run -a masakari sudo systemctl restart masakari-engine apache2
juju run -a heat sudo systemctl restart heat-api heat-engine
juju run -a aodh sudo systemctl restart aodh-listener aodh-notifier
juju run -a designate sudo systemctl restart designate-api designate-mdns designate-worker designate-agent

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1993149

Title:
  VMs stay stuck in scheduling when rabbitmq leader unit is down

Status in OpenStack RabbitMQ Server Charm:
  Triaged
Status in Ubuntu Cloud Archive:
  Fix Committed
Status in Ubuntu Cloud Archive yoga series:
  Fix Committed
Status in Ubuntu Cloud Archive zed series:
  Fix Committed
Status in oslo.messaging:
  New
Status in python-oslo.messaging package in Ubuntu:
  Fix Committed
Status in python-oslo.messaging source package in Jammy:
  Fix Committed
Status in python-oslo.messaging source package in Kinetic:
  Fix Committed

Bug description:
  When testing rabbitmq-server HA in our OpenStack Yoga cloud
  environment (Rabbitmq Server release 3.9/stable) we faced the
  following issues:

  - When the leader unit is down we are unable to launch any VMs and the
  launched ones stay stuck in the 'BUILD' state.

  - While checking the logs we see that several OpenStack services has
  issues in communicating with the rabbitmq-server

  - After restarting all the services using rabbitmq (like Nova, Cinder,
  Neutron etc) the issue gets resolved and the VMs can be launched
  successfully

  The corresponding logs are available at:
  https://pastebin.ubuntu.com/p/Bk3yktR8tp/

  We also observed the same for rabbitmq-server unit which is first in
  the list of 'nova.conf' file, and after restarting the concerned
  rabbitmq unit we see that scheduling of VMs work fine again.

  As this can be seen from this part of the log as well:
  "Reconnected to AMQP server on 192.168.34.251:5672 via [amqp] client with port 41922."

  ====== Ubuntu SRU Details =======

  [Impact]
  Active/active HA for rabbitmq is broken when a node goes down. 

  [Test Case]
  Deploy openstack with 3 units of rabbitmq in active/active HA.

  [Regression Potential]
  Due to the criticality of this issue, I've decided to revert the upstream change that is causing the problem as a stop-gap until a proper fix is in place. That fix came in via https://bugs.launchpad.net/oslo.messaging/+bug/1935864. As a result we may see performance degradation in polling as described in that bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-rabbitmq-server/+bug/1993149/+subscriptions