[Bug 1896734] Re: A privsep daemon spawned by neutron-openvswitch-agent hangs when debug logging is enabled (large number of registered NICs) - an RPC response is too large for msgpack
Mohammed Naser
1896734 at bugs.launchpad.net
Wed Dec 1 06:29:35 UTC 2021
The stack trace is the following:
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: ------ Green Thread ------
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/bin/neutron-linuxbridge-agent:8 in <module>
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `sys.exit(main())`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/cmd/eventlet/plugins/linuxbridge_neutron_agent.py:21 in main
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `agent_main.main()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:1055 in main
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `manager = LinuxBridgeManager(bridge_mappings, interface_mappings)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:93 in __init__
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `self.check_vxlan_support()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:745 in check_vxlan_sup
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `if self.vxlan_ucast_supported():`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py:706 in vxlan_ucast_sup
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `ip_lib.vxlan_in_use(seg_id)):`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py:741 in vxlan_in_use
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `interfaces = get_devices_info(namespace)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/neutron/agent/linux/ip_lib.py:1379 in get_devices_info
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `devices = privileged.get_link_devices(namespace, **kwargs)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/priv_context.py:247 in _wrap
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `return self.channel.remote_call(name, args, kwargs)`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/daemon.py:214 in remote_call
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `result = self.send_recv((Message.CALL.value, name, args, kwargs))`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/comm.py:172 in send_recv
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `reply = future.result()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/oslo_privsep/comm.py:109 in result
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `self.condvar.wait()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /usr/lib/python3.6/threading.py:295 in wait
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `waiter.acquire()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/eventlet/semaphore.py:115 in acquire
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `hubs.get_hub().switch()`
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: /openstack/venvs/neutron-21.2.6/lib/python3.6/site-packages/eventlet/hubs/hub.py:298 in switch
Dec 01 07:19:38 tctrko1 neutron-linuxbridge-agent[18892]: `return self.greenlet.switch()`
In this case it looks like `privileged.get_link_devices` is what needs
to be optimized which was not covered by the fix.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to python-oslo.privsep in Ubuntu.
https://bugs.launchpad.net/bugs/1896734
Title:
A privsep daemon spawned by neutron-openvswitch-agent hangs when debug
logging is enabled (large number of registered NICs) - an RPC response
is too large for msgpack
Status in OpenStack Neutron Open vSwitch Charm:
Invalid
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive ussuri series:
Fix Released
Status in Ubuntu Cloud Archive victoria series:
Fix Released
Status in neutron:
Fix Released
Status in oslo.privsep:
New
Status in neutron package in Ubuntu:
Fix Released
Status in python-oslo.privsep package in Ubuntu:
New
Status in neutron source package in Focal:
Fix Released
Status in python-oslo.privsep source package in Focal:
New
Status in neutron source package in Groovy:
Fix Released
Status in python-oslo.privsep source package in Groovy:
Won't Fix
Status in neutron source package in Hirsute:
Fix Released
Status in python-oslo.privsep source package in Hirsute:
New
Bug description:
[Impact]
When there is a large amount of netdevs registered in the kernel and
debug logging is enabled, neutron-openvswitch-agent and the privsep
daemon spawned by it hang since the RPC call result sent by the
privsep daemon over a unix socket exceeds the message sizes that the
msgpack library can handle.
The impact of this is that enabling debug logging on the cloud
completely stalls neutron-openvswitch-agents and makes them "dead"
from the Neutron server perspective.
The issue is summarized in detail in comment #5
https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
[Test Plan]
* deploy Openstack Train/Ussuri/Victoria
* need at least one compute host
* enable neutron debug logging
* create a load of interfaces on your compute host to create a large 'ip addr show' output
* for ((i=0;i<400;i++)); do ip tuntap add mode tap tap-`uuidgen| cut -c1-11`; done
* create a single vm
* add floating ip
* ping fip
* create 20 ports and attach them to the vm
* for ((i=0;i<20;i++)); do id=`uuidgen`; openstack port create --network private --security-group __SG__ X-$id; openstack server add port __VM__ X-$id; done
* attaching ports should not result in errors
[Where problems could occur]
No problems anticipated this patchset.
========================================================================
When there is a large amount of netdevs registered in the kernel and
debug logging is enabled, neutron-openvswitch-agent and the privsep
daemon spawned by it hang since the RPC call result sent by the
privsep daemon over a unix socket exceeds the message sizes that the
msgpack library can handle.
The impact of this is that enabling debug logging on the cloud
completely stalls neutron-openvswitch-agents and makes them "dead"
from the Neutron server perspective.
The issue is summarized in detail in comment #5
https://bugs.launchpad.net/oslo.privsep/+bug/1896734/comments/5
========================================================================
Old Description
While trying to debug a different issue, I encountered a situation
where privsep hangs in the process of handling a request from neutron-
openvswitch-agent when debug logging is enabled (juju debug-log
neutron-openvswitch=true):
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/11
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1895652/comments/12
The issue gets reproduced reliably in the environment where I
encountered it on all units. As a result, neutron-openvswitch-agent
services hang while waiting for a response from the privsep daemon and
do not progress past basic initialization. They never post any state
back to the Neutron server and thus are marked dead by it.
The processes though are shown as "active (running)" by systemd which
adds to the confusion since they do indeed start from the systemd's
perspective.
systemctl --no-pager status neutron-openvswitch-agent.service
● neutron-openvswitch-agent.service - Openstack Neutron Open vSwitch Plugin Agent
Loaded: loaded (/lib/systemd/system/neutron-openvswitch-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2020-09-23 08:28:41 UTC; 25min ago
Main PID: 247772 (/usr/bin/python)
Tasks: 4 (limit: 9830)
CGroup: /system.slice/neutron-openvswitch-agent.service
├─247772 /usr/bin/python3 /usr/bin/neutron-openvswitch-agent --config-file=/etc/neutron/neutron.conf --config-file=/etc/neutron/plugins/ml2/openvswitch_…og
└─248272 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini -…ck
--------------------------------------------------------
An strace shows that the privsep daemon tries to receive input from fd
3 which is the unix socket it uses to communicate with the client.
However, this is just one tread out of many spawned by the privsep
daemon so it is unlikely to be the root cause (there are 65 threads
there in total, see https://paste.ubuntu.com/p/fbGvN2P8rP/)
# there is one extra neutron-openvvswitch-agent running in a LXD container which can be ignored here (there is an octavia unit on the node which has a neutron-openvswitch subordinate)
root at node2:~# ps -eo pid,user,args --sort user | grep -P 'privsep.*openvswitch'
860690 100000 /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmp910qakfk/privsep.sock
248272 root /usr/bin/python3 /usr/bin/privsep-helper --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugins/ml2/openvswitch_agent.ini --privsep_context neutron.privileged.default --privsep_sock_path /tmp/tmpcmwn7vom/privsep.sock
363905 root grep --color=auto -P privsep.*openvswitch
root at node2:~# strace -f -p 248453 2>&1
[pid 248786] futex(0x7f6a6401c1d0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248475] futex(0x7f6a6c024590, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248473] futex(0x7f6a746d9fd0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff <unfinished ...>
[pid 248453] recvfrom(3,
root at node2:~# lsof -p 248453 | grep 3u
privsep-h 248453 root 3u unix 0xffff8e6d8abdec00 0t0 356522977 type=STREAM
root at node2:~# ss -pax | grep 356522977
u_str ESTAB 0 0 /tmp/tmp2afa3enn/privsep.sock 356522978
* 356522977 users:(("/usr/bin/python",pid=247567,fd=16))
u_str ESTAB 0 0 * 356522977
* 356522978 users:(("privsep-helper",pid=248453,fd=3))
root at node2:~# lsof -p 247567 | grep 16u
/usr/bin/ 247567 neutron 16u unix 0xffff8e6d8abdb400 0t0 356522978 /tmp/tmp2afa3enn/privsep.sock type=STREAM
To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-neutron-openvswitch/+bug/1896734/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list