[Bug 1927868] Re: vRouter not working after update to 16.3.1

Edward Hope-Morley 1927868 at bugs.launchpad.net
Fri Aug 20 11:24:10 UTC 2021


@christian-rohmann The problem essentially boils down to the exception
at [1] being raised because prior to that [2] gets called as a result of
a timeout exception but the code is not actually catching the exception.
This was traced to be the result of a privileged call being used as
argument to [3] from [4] (which is in the patch we reverted).

So the *real* problem with privsep code is that if an unexpected
exception is raised, it does not get caught thus either killing the
reader thread and/or never releasing the lock. There is a separate bug
[5] which was raised about the same issue that led to the fix [6] being
added to privsep which, crucially, replaces the raised AttributeError
with a continue thus stopping it from killing the reader thread. I have
not yet tested whether this actually fixes all the agent issues we have
seen though and while we should do this, there is still room for
improvement in the privsep code namely [7] which should have an except
clause that, if nothing else, prints a log message to say that the
message timed out.

[1] https://github.com/openstack/oslo.privsep/blob/6d41ef9f91b297091aa37721ba10456142fc5107/oslo_privsep/comm.py#L141
[2] https://github.com/openstack/oslo.privsep/blob/6d41ef9f91b297091aa37721ba10456142fc5107/oslo_privsep/comm.py#L174
[3] https://github.com/openstack/neutron/blob/d4b1b4a0729c187551e1fa2b2855db136456d496/neutron/common/utils.py#L689
[4] https://github.com/openstack/neutron/blob/d8f1f1118d3cde0b5264220836a250f14687893e/neutron/agent/linux/interface.py#L328
[5] https://bugs.launchpad.net/neutron/+bug/1930401
[6] https://github.com/openstack/oslo.privsep/commit/f7f3349d6a4def52f810ab1728879521c12fe2d0
[7] https://github.com/openstack/oslo.privsep/blob/f7f3349d6a4def52f810ab1728879521c12fe2d0/oslo_privsep/comm.py#L189

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/1927868

Title:
  vRouter not working after update to 16.3.1

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive train series:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in Ubuntu Cloud Archive wallaby series:
  Fix Released
Status in Ubuntu Cloud Archive xena series:
  Fix Released
Status in neutron:
  In Progress
Status in oslo.privsep:
  New
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Focal:
  Fix Released
Status in neutron source package in Hirsute:
  Fix Released
Status in neutron source package in Impish:
  Fix Released

Bug description:
  We run a juju managed Openstack Ussuri on Bionic. After updating
  neutron packages from 16.3.0 to 16.3.1 all virtual routers stopped
  working. It seems that most (not all) namespaces are created but have
  only the lo interface and sometime the ha-XYZ interface in DOWN state.
  The underlying tap interfaces are also in down.

  neutron-l3-agent has many logs similar to the following:
  2021-05-08 15:01:45.286 39411 ERROR neutron.agent.l3.ha_router [-] Gateway interface for router 02945b59-639b-41be-8237-3b7933b4e32d was not set up; router will not work properly

  and journal logs report at around the same time
  May 08 15:01:40 lar1615.srv-louros.grnet.gr neutron-keepalived-state-change[18596]: 2021-05-08 15:01:40.765 18596 INFO neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 62.62.62.62 on qg-5a6efe8c-6b in namespace qrouter-02945b59-639b-41be-8237-3b7933b4e32d: Exit code: 2; Stdin: ; Stdout: Interface "qg-5a6efe8c-6b" is down
  May 08 15:01:40 lar1615.srv-louros.grnet.gr neutron-keepalived-state-change[18596]: 2021-05-08 15:01:40.767 18596 INFO neutron.agent.linux.ip_lib [-] Interface qg-5a6efe8c-6b or address 62.62.62.62 in namespace qrouter-02945b59-639b-41be-8237-3b7933b4e32d was deleted concurrently

  The neutron packages installed are:

  ii  neutron-common                         2:16.3.1-0ubuntu1~cloud0                                    all          Neutron is a virtual network service for Openstack - common
  ii  neutron-dhcp-agent                     2:16.3.1-0ubuntu1~cloud0                                    all          Neutron is a virtual network service for Openstack - DHCP agent
  ii  neutron-l3-agent                       2:16.3.1-0ubuntu1~cloud0                                    all          Neutron is a virtual network service for Openstack - l3 agent
  ii  neutron-metadata-agent                 2:16.3.1-0ubuntu1~cloud0                                    all          Neutron is a virtual network service for Openstack - metadata agent
  ii  neutron-metering-agent                 2:16.3.1-0ubuntu1~cloud0                                    all          Neutron is a virtual network service for Openstack - metering agent
  ii  neutron-openvswitch-agent              2:16.3.1-0ubuntu1~cloud0                                    all          Neutron is a virtual network service for Openstack - Open vSwitch plugin agent
  ii  python3-neutron                        2:16.3.1-0ubuntu1~cloud0                                    all          Neutron is a virtual network service for Openstack - Python library
  ii  python3-neutron-lib                    2.3.0-0ubuntu1~cloud0                                       all          Neutron shared routines and utilities - Python 3.x
  ii  python3-neutronclient                  1:7.1.1-0ubuntu1~cloud0                                     all          client API library for Neutron - Python 3.x

  Downgrading to 16.3.0 resolves the issues.

  =================================

  Ubuntu SRU details:

  [Impact]
  See above.

  [Test Case]
  Deploy openstack with l3ha and create several HA routers, the number required varies per environment. It is probably best to deploy a known bad version of the package, ensure it is failing, upgrade to the version in proposed, and re-test several times to confirm it is fixed.

  Restarting neutron-l3-agent should expect all HA Routers are restored.

  [Regression Potential]
  This change is fixing a regression by reverting a patch that was introduced in a stable point release of neutron.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1927868/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list