[Bug 1731595] Re: L3 HA: multiple agents are active at the same time

Xav Paice xav.paice at canonical.com
Wed Dec 13 03:15:22 UTC 2017


We have installed the Ocata -proposed package, however the situation is
this:

- there's 464 routers configured, on 3 Neutron gateway hosts, using l3-ha, and each router is scheduled to all 3 hosts.
- we installed the package because were in a situation with a current incident with multiple l3 agents active, hoping the package update would solve the problem.  One of the gateway hosts was being rebooted at the time to also try to do a King Canute and halt the tidal wave of arp.
- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.
- Removing/ re-adding the routers to agents seemed to clean things up, we saw some routers with multiple agents active, and some with none active (all 3 agents 'standby').
- After a few iterations of that, things cleaned up.
- 15-20 mins later, we saw more routers with multiple agents active (ones which weren't before), and ran through the same cleanup steps.  At this time, there were a large number of keepalived messages in syslog, particularly routers becoming MASTER then BACKUP again. (https://pastebin.canonical.com/205361/)
- after another hour or two, we're still clean.

I can't at this stage whether the fix actually fixed the problem or not
- I need to dig further to find out if there could have been some
process running cleanups.

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1731595

Title:
  L3 HA: multiple agents are active at the same time

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive newton series:
  Fix Committed
Status in Ubuntu Cloud Archive ocata series:
  Fix Committed
Status in Ubuntu Cloud Archive pike series:
  Fix Committed
Status in Ubuntu Cloud Archive queens series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Xenial:
  Triaged
Status in neutron source package in Zesty:
  Fix Committed
Status in neutron source package in Artful:
  Fix Committed
Status in neutron source package in Bionic:
  Fix Released

Bug description:
  OS: Xenial, Ocata from Ubuntu Cloud Archive
  We have three neutron-gateway hosts, with L3 HA enabled and a min of 2, max of 3.  There are approx. 400 routers defined.

  At some point (we weren't monitoring exactly) a number of the routers
  changed from being one active, and 1+ others standby, to >1 active.
  This included each of the 'active' namespaces having the same IP
  addresses allocated, and therefore traffic problems reaching
  instances.

  Removing the routers from all but one agent, and re-adding, resolved
  the issue.  Restarting one l3 agent also appeared to resolve the
  issue, but very slowly, to the point where we needed the system alive
  again faster and reverted to removing/re-adding.

  At the same time, a number of routers were listed without any agents
  active at all.  This situation appears to have been resolved by adding
  routers to agents, after several minutes downtime.

  I'm finding it very difficult to find relevant keepalived messages to
  indicate what's going on, but what I do notice is that all the agents
  have equal priority and are configured as 'backup'.

  I am trying to figure out a way to get a reproducer of this, it might
  be that we need to have a large number of routers configured on a
  small number of gateways.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1731595/+subscriptions



More information about the Ubuntu-openstack-bugs mailing list