[Bug 1896734] Re: A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack

Edward Hope-Morley 1896734 at bugs.launchpad.net
Fri Aug 4 12:26:47 UTC 2023


The scenario discussed in this bug so far was in the context of neutron-
openvswitch-agent and a neutron patch resolved that aspect but the
problem has resurfaced in environments using neutron-l3-agent with very
large numbers of ports and the cause/fix is as follows:

msgpack patch [1] landed in version v1.0.0 and made the following changes:
   * set Unpacker() default max_buffer_size to 1M (was 0)
   * set max_buffer_size to INT_MAX i.e. 2147483647 (1G) if max_buffer_size is set to 0
   * set indiviual buffer sizes to max_buffer_size

Prior to this (e.g. in v0.6.2 which ships with Openstack Ussuri in
Ubuntu Focal) max_buffer_size default was 0 and individal buffers had
varying and small defaults (see https://github.com/msgpack/msgpack-
python/blob/997b524f06176aaa6bd255a046a8746e99b4f87d/msgpack/_unpacker.pyx#L368).
They were also set to a default if max_buffer_size was 0 but *before*
max_buffer_size was set to INT_MAX if it was also 0.

This meant that oslo.privsep in Openstack Ussuri, which does not set a
max_buffer_size, was ending up with e.g. max_str_len of 1024 * 1024 (1M)
which is a limit it easily hits on hosts with large numbers of ports and
the result being that oslo.privsep crashes with a log like:

$ journalctl -D /var/log/journal/ --unit neutron-l3-agent --grep ValueError
-- Boot 8fde93ff89f443f58f2a1d3307dad1d3 --
Jul 22 12:16:58 ubuntu neutron-l3-agent[32426]: ValueError: 1050710 exceeds max_str_len(1048576)

Note that this log is only logged to the journal and not
neutron-l3-agent.log since it is happening in privsep.

Patch [2] landed in oslo.privsep 2.8.0 that sets max_buffer_size to
100M. This version of privsep is available with Openstack Zed and
beyond.

So in order to fix this for Openstack Ussuri upwards we need to look at
backporting patch [2] down to python3-oslo.privsep 2.1.1 (in focal-
updates). It is very small patch so I think it should be safe to do.

[1] https://github.com/msgpack/msgpack-python/commit/c356035a576c38db5ca232ede07b291087f1b8b2
[2] https://github.com/openstack/oslo.privsep/commit/c223dbced7d5a8d1920fe764cbce42cf844538e1


** Also affects: neutron (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** Also affects: python-oslo.privsep (Ubuntu Jammy)
   Importance: Undecided
       Status: New

** No longer affects: python-oslo.privsep (Ubuntu Groovy)

** No longer affects: python-oslo.privsep (Ubuntu Hirsute)

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to python-oslo.privsep in Ubuntu.
https://bugs.launchpad.net/bugs/1896734

Title:
  A privsep daemon spawned by neutron-openvswitch-agent hangs when debug
  logging is enabled (large number of registered NICs) - an RPC response
  is too large for msgpack

Status in OpenStack Neutron Open vSwitch Charm:
  Invalid
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive ussuri series:
  Fix Released
Status in Ubuntu Cloud Archive victoria series:
  Fix Released
Status in neutron:
  Fix Released
Status in oslo.privsep:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in python-oslo.privsep package in Ubuntu:
  New
Status in neutron source package in Focal:
  Fix Released
Status in python-oslo.privsep source package in Focal:
  New
Status in neutron source package in Groovy:
  Fix Released
Status in neutron source package in Hirsute:
  Fix Released
Status in neutron source package in Jammy:
  New
Status in python-oslo.privsep source package in Jammy:
  New

Bug description:
  [Impact]

  When there is a large amount of netdevs registered in the kernel and
  debug logging is enabled, neutron-openvswitch-agent and the privsep
  daemon spawned by it hang since the RPC call result sent by the
  privsep daemon over a unix socket exceeds the message sizes that the
  msgpack library can handle.

  The impact of this is that enabling debug logging on the cloud
  completely stalls neutron-openvswitch-agents and makes them "dead"
  from the Neutron server perspective.

  The issue is summarized in detail in comment #5
  https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5

  [Test Plan]

    * deploy Openstack Train/Ussuri/Victoria
    * need at least one compute host
    * enable neutron debug logging
    * create a load of interfaces on your compute host to create a large 'ip addr show' output
    * for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done
    * create a single vm
    * add floating ip
    * ping fip
    * create 20 ports and attach them to the vm
    * for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group __SG__ X-$id; openstack server add port __VM__ X-$id; done
    * attaching ports should not result in errors

  [Where problems could occur]

  No problems anticipated this patchset.

  ========================================================================

  When there is a large amount of netdevs registered in the kernel and
  debug logging is enabled, neutron-openvswitch-agent and the privsep
  daemon spawned by it hang since the RPC call result sent by the
  privsep daemon over a unix socket exceeds the message sizes that the
  msgpack library can handle.

  The impact of this is that enabling debug logging on the cloud
  completely stalls neutron-openvswitch-agents and makes them "dead"
  from the Neutron server perspective.

  The issue is summarized in detail in comment #5
  https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5

  ========================================================================
  Old Description

  While trying to debug a different issue, I encountered a situation
  where privsep hangs in the process of handling a request from neutron-
  openvswitch-agent when debug logging is enabled (juju debug-log
  neutron-openvswitch=true):

  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
  https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12

  The issue gets reproduced reliably in the environment where I
  encountered it on all units. As a result, neutron-openvswitch-agent
  services hang while waiting for a response from the privsep daemon and
  do not progress past basic initialization. They never post any state
  back to the Neutron server and thus are marked dead by it.

  The processes though are shown as "active (running)" by systemd which
  adds to the confusion since they do indeed start from the systemd's
  perspective.

  systemctl --no-pager status neutron-openvswitch-agent.service
  ● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
     Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
   Main PID: 247772 (/usr/bin/python)
      Tasks: 4 (limit: 9830)
     CGroup: /system.slice/neutron-openvswitch-agent.service
             ├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
             └─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck

  --------------------------------------------------------

  An strace shows that the privsep daemon tries to receive input from fd
  3 which is the unix socket it uses to communicate with the client.
  However, this is just one tread out of many spawned by the privsep
  daemon so it is unlikely to be the root cause (there are 65 threads
  there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)

  # there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
  root at node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
   860690 100000   /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
   248272 root     /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
   363905 root     grep --color=auto -P privsep.*openvswitch

  root at node2:~# strace -f -p 248453 2>&1
  [pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
  [pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
  [pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
  [pid 248453] recvfrom(3,

  root at node2:~# lsof -p 248453  | grep 3u
  privsep-h 248453 root    3u  unix 0xffff8e6d8abdec00      0t0 356522977 type=STREAM

  root at node2:~# ss -pax | grep 356522977
  u_str             ESTAB               0                    0                                                         /tmp/tmp2afa3enn/privsep.sock 356522978
                                    * 356522977                              users:(("/usr/bin/python",pid=247567,fd=16))
  u_str             ESTAB               0                    0                                                                                     * 356522977
                                    * 356522978                              users:(("privsep-helper",pid=248453,fd=3))

  root at node2:~# lsof -p 247567  | grep 16u
  /usr/bin/ 247567 neutron   16u     unix 0xffff8e6d8abdb400      0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1896734/+subscriptions




More information about the Ubuntu-openstack-bugs mailing list