[Bug 1953591] Re: neutron-ovn-metadata-agent does not respond on network until restarted after SB disconnects

Thu Aug 18 19:17:35 UTC 2022

** Also affects: neutron (Ubuntu)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1953591

Title:
  neutron-ovn-metadata-agent does not respond on network until restarted
  after SB disconnects

Status in Ubuntu Cloud Archive:
  New
Status in networking-ovn:
  New
Status in neutron package in Ubuntu:
  New

Bug description:
  Hi,

  We are running OpenStack Xena in a highly dynamic environment with a
  few hundreds tenant networks and projects and are using OVN setup in a
  3-node cluster for the northbound and southbound databases as well as
  the Northd daemon.

  With a few hundreds instances, we started to notice that when starting
  new instances, the instances get their DHCP information via OVN
  Openflow rules, but cloud-init was not installing the initial
  configuration (easily recognizable by looking at the log: the console
  prompt at the end show that the instance did not have its name
  configured, nor the ssh public key).

  After some investigations we pinpointed noticed that on these
  instances the neutron-ovn-metadata-agent  was not responding on IP
  169.254.169.254. After restarting the agent on the corresponding host
  hypervisor and waiting a few seconds, a simple reboot of the instance
  would fix the issue.

  It seems the instance metadata server is not reachable specifically
  when we are starting an instance in a new network/subnet.

  We then enabled the debug logs on the metadata agent and only noticed
  that the agents are being disconnected from the SB DB then reconnected
  immediately but without any additionnal relevant log messages.

  First we looked at our OVN cluster status and noticed that the cluster
  was flapping very frequently (changing NB, SB and northd leaders) and
  fixed that as well with adjusting the inactivity probe and election
  timers.

  Since then the OVN cluster is pretty stable and only changes leader
  (and increment term) when the SB leader voluntarily transfers
  leadership to take a snapshot of the database, every few hours
  according to the "Last election started" timer.

  It seems the neutron-ovn-metadata-agent still breaks when the OVN SB
  changes leaders (and SB is disconnected and reconnected again), and
  does not respond in new networks until restarted.

  My understanding is that the metadata agent usually creates a new
  haproxy instance in a dedicated namespace on the host where the
  instance is created, but fails to do so as soon as it's disconnected
  from the SB DB, even after reconnecting to the new SB leader (almost
  instantly)

  The real issue here is that there is no logs other than the usual
  disconnects/reconnects when this happens.

  The "openstack network agent list" reports the agent down as well when
  this occurs and for now we had no other choice than restarting the
  metadata agent every 5 minutes to somewhat make this issue invisible
  to our end users.

  Is anyone already having this issue ?
  We are running Openstack Xena deployed using kolla-ansible, using container images (Centos 8 Stream+ Openstack source) built with kolla. The relevant versions are :
  Python neutron 19.0.1.dev8
  Python ovs 2.13.3
  Python ovsdbapp 1.12.0
  haproxy 1.8.27-2.el8
  openvswitch2.15-2.15.0-41.el8s

  neutron_ovn_metadata_agent.ini:
  [ovs]
  ovsdb_connection = tcp:127.0.0.1:6640
  ovsdb_timeout = 10

  [ovn]
  ovn_nb_connection = tcp:XXX:6641,tcp:YYY:6641,tcp:ZZZ:6641
  ovn_sb_connection = tcp:XXX:6642,tcp:YYY:6642,tcp:ZZZ:6642
  ovn_metadata_enabled = true

  We literally have no other logs than the OVN disconnection and reconnection lines :
  2021-12-08 09:11:21.030 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: clustered database server is not cluster leader; trying another server
  2021-12-08 09:11:21.032 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: connection closed by client
  2021-12-08 09:11:21.032 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: waiting 2 seconds before reconnect
  2021-12-08 09:11:21.033 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: clustered database server is not cluster leader; trying another server
  2021-12-08 09:11:21.034 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:XXX:6642: connection closed by client
  2021-12-08 09:11:22.034 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connecting...
  2021-12-08 09:11:22.035 7 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connected
  2021-12-08 09:11:23.034 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connecting...
  2021-12-08 09:11:23.034 23 INFO ovsdbapp.backend.ovs_idl.vlog [-] tcp:ZZZ:6642: connected

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1953591/+subscriptions