[Bug 1953591] Re: neutron-ovn-metadata-agent does not respond on network until restarted after SB disconnects
Felipe Alencastro
1953591 at bugs.launchpad.net
Thu Aug 18 19:17:35 UTC 2022
** Also affects: neutron (Ubuntu)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1953591
Title:
neutron-ovn-metadata-agent does not respond on network until restarted
after SB disconnects
Status in Ubuntu Cloud Archive:
New
Status in networking-ovn:
New
Status in neutron package in Ubuntu:
New
Bug description:
Hi,
We are running OpenStack Xena in a highly dynamic environment with a
few hundreds tenant networks and projects and are using OVN setup in a
3-node cluster for the northbound and southbound databases as well as
the Northd daemon.
With a few hundreds instances, we started to notice that when starting
new instances, the instances get their DHCP information via OVN
Openflow rules, but cloud-init was not installing the initial
configuration (easily recognizable by looking at the log: the console
prompt at the end show that the instance did not have its name
configured, nor the ssh public key).
After some investigations we pinpointed noticed that on these
instances the neutron-ovn-metadata-agent was not responding on IP
169.254.169.254. After restarting the agent on the corresponding host
hypervisor and waiting a few seconds, a simple reboot of the instance
would fix the issue.
It seems the instance metadata server is not reachable specifically
when we are starting an instance in a new network/subnet.
We then enabled the debug logs on the metadata agent and only noticed
that the agents are being disconnected from the SB DB then reconnected
immediately but without any additionnal relevant log messages.
First we looked at our OVN cluster status and noticed that the cluster
was flapping very frequently (changing NB, SB and northd leaders) and
fixed that as well with adjusting the inactivity probe and election
timers.
Since then the OVN cluster is pretty stable and only changes leader
(and increment term) when the SB leader voluntarily transfers
leadership to take a snapshot of the database, every few hours
according to the "Last election started" timer.
It seems the neutron-ovn-metadata-agent still breaks when the OVN SB
changes leaders (and SB is disconnected and reconnected again), and
does not respond in new networks until restarted.
My understanding is that the metadata agent usually creates a new
haproxy instance in a dedicated namespace on the host where the
instance is created, but fails to do so as soon as it's disconnected
from the SB DB, even after reconnecting to the new SB leader (almost
instantly)
The real issue here is that there is no logs other than the usual
disconnects/reconnects when this happens.
The "openstack network agent list" reports the agent down as well when
this occurs and for now we had no other choice than restarting the
metadata agent every 5 minutes to somewhat make this issue invisible
to our end users.
Is anyone already having this issue ?
We are running Openstack Xena deployed using kolla-ansible, using container images (Centos 8 Stream+ Openstack source) built with kolla. The relevant versions are :
Python neutron 19.0.1.dev8
Python ovs 2.13.3
Python ovsdbapp 1.12.0
haproxy 1.8.27-2.el8
openvswitch2.15-2.15.0-41.el8s
neutron_ovn_metadata_agent.ini:
[ovs]
ovsdb_connection = tcp:127.0.0.1:6640
ovsdb_timeout = 10
[ovn]
ovn_nb_connection = tcp:XXX:6641,tcp:YYY:6641,tcp:ZZZ:6641
ovn_sb_connection = tcp:XXX:6642,tcp:YYY:6642,tcp:ZZZ:6642
ovn_metadata_enabled = true
We literally have no other logs than the OVN disconnection and reconnection lines :
2021-12-08 09:11:21.030 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: clustered database server is not cluster leader; trying another server
2021-12-08 09:11:21.032 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: connection closed by client
2021-12-08 09:11:21.032 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: waiting 2 seconds before reconnect
2021-12-08 09:11:21.033 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: clustered database server is not cluster leader; trying another server
2021-12-08 09:11:21.034 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: connection closed by client
2021-12-08 09:11:22.034 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connecting...
2021-12-08 09:11:22.035 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connected
2021-12-08 09:11:23.034 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connecting...
2021-12-08 09:11:23.034 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connected
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1953591/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list