[Bug 1744062] Re: L3 HA: multiple agents are active at the same time

Wed Jan 24 09:44:20 UTC 2018

I have some thoughts in my mind for this problem as below:

1, First of all, we need to figure out why it will appear multiple
ACTIVE master HA nodes in theory ?

Assume the master is dead (at this time, it's status in DB is still
ACTIVE), then slave will be selected to new master. After the old master
has recovered, the L444 this.enable_keepalived() [4] will be invoked to
spawn keepalived instance, so multiple ACTIVE master HA nodes occur.
(Related patch - https://review.openstack.org/#/c/357458/)

So the key to solving this problem is to reset the status of all HA
ports into DOWN at a certain code path, so the patch
https://review.openstack.org/#/c/470905/ is used to address this point.
But this patch sets the status=DOWN at this code path
'fetch_and_sync_all_routers -> get_router_ids' which will lead to a
bigger problem when the load is large.

2, Why setting status=DOWN in the code path 'fetch_and_sync_all_routers
-> get_router_ids' will lead to a bigger problem when the load is large
?

If l3-agent is not active via heartbeat check, l3-agent will be set
status=AGENT_REVIVED [1], then l3-agent will be triggered to do a full
sync (self.fullsync=True) [2] so that the code logic
'periodic_sync_routers_task -> fetch_and_sync_all_routers' will be
called again and again [3].

All these operations will aggravate the load for l2-agent, l2-agent, DB
and MQ etc. Conversely, large load also will aggravate AGENT_REVIVED
case.

So it's a vicious circle, the patch
https://review.openstack.org/#/c/522792/ is used to address this point.
It uses the code path '__init__ -> get_service_plugin_list ->
_update_ha_network_port_status' instead of the code path
'periodic_sync_routers_task -> fetch_and_sync_all_routers'.

3, We have known, the small heartbeat value can cause AGENT_REVIVED then
aggravate the load, the high load can cause other problems, like some
phenomenons Xav mentioned before, I pasted them as below as well:

- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.

This is just an example, there may be other circumstances. All those let
us mistake the fix doesn't fix the problem.

The high load can also cause other similar problem, for another example:

a, can cause the process neutron-keepalived-state-change to exit due to
term singal [5] (https://paste.ubuntu.com/26450042/),  neutron-
keepalived-state-change is used to monitor vrrp's VIP change then update
the ha_router's status to neutron-server [6]. so that l3-agent will not
be able to update the status for ha ports, thus we can see multiple
ACTIVE case or multiple STANDBY case or others.

b, can cause the RPC message sent from here [6] can not be handled well.

So for this problem, my concrete opinion is:

a, bump up heartbeat option (agent_down_time)

b, we need this patch: https://review.openstack.org/#/c/522641/

c, Ensure that other components (like MQ, DB etc) have no performance
problems

[1] https://github.com/openstack/neutron/blob/stable/ocata/neutron/db/agents_db.py#L354
[2] https://github.com/openstack/neutron/blob/stable/ocata/neutron/agent/l3/agent.py#L736
[3] https://github.com/openstack/neutron/blob/stable/ocata/neutron/agent/l3/agent.py#L583
[4] https://github.com/openstack/neutron/blob/stable/ocata/neutron/agent/l3/ha_router.py#L444
[5] https://github.com/openstack/neutron/blob/stable/ocata/neutron/agent/l3/keepalived_state_change.py#L134
[6] https://github.com/openstack/neutron/blob/stable/ocata/neutron/agent/l3/keepalived_state_change.py#L71

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1744062

Title:
  L3 HA: multiple agents are active at the same time

Status in Ubuntu Cloud Archive:
  Triaged
Status in Ubuntu Cloud Archive mitaka series:
  Triaged
Status in Ubuntu Cloud Archive newton series:
  Triaged
Status in Ubuntu Cloud Archive ocata series:
  Triaged
Status in Ubuntu Cloud Archive pike series:
  Triaged
Status in Ubuntu Cloud Archive queens series:
  Triaged
Status in neutron:
  New
Status in neutron package in Ubuntu:
  Triaged
Status in neutron source package in Xenial:
  Triaged
Status in neutron source package in Artful:
  Triaged
Status in neutron source package in Bionic:
  Triaged

Bug description:
  This is the same issue reported in
  https://bugs.launchpad.net/neutron/+bug/1731595, however that is
  marked as 'Fix Released' and the issue is still occurring and I can't
  change back to 'New' so it seems best to just open a new bug.

  It seems as if this bug surfaces due to load issues. While the fix
  provided by Venkata (https://review.openstack.org/#/c/522641/) should
  help clean things up at the time of l3 agent restart, issues seem to
  come back later down the line in some circumstances. xavpaice
  mentioned he saw multiple routers active at the same time when they
  had 464 routers configured on 3 neutron gateway hosts using L3HA, and
  each router was scheduled to all 3 hosts. However, jhebden mentions
  that things seem stable at the 400 L3HA router mark, and it's worth
  noting this is the same deployment that xavpaice was referring to.

  It seems to me that something is being pushed to it's limit, and
  possibly once that limit is hit, master router advertisements aren't
  being received, causing a new master to be elected. If this is the
  case it would be great to get to the bottom of what resource is
  getting constrained.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1744062/+subscriptions