[Bug 2083008] Re: Enabling HW offloading to Mellanox cards causing packets storm and switch overloading
Danilo Egea Gondolfo
2083008 at bugs.launchpad.net
Wed Oct 2 13:18:53 UTC 2024
** Changed in: netplan
Status: New => In Progress
** Also affects: netplan.io (Ubuntu)
Importance: Undecided
Status: New
** Also affects: netplan.io (Ubuntu Jammy)
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to netplan.io in Ubuntu.
Matching subscriptions: foundations-bugs
https://bugs.launchpad.net/bugs/2083008
Title:
Enabling HW offloading to Mellanox cards causing packets storm and
switch overloading
Status in Netplan:
In Progress
Status in netplan.io package in Ubuntu:
New
Status in netplan.io source package in Jammy:
New
Bug description:
Ubuntu 22.04.5 HWE Kernel 6.8.0-45-generic
Mellanox ConnectX-6 Dx
Firmware: 22.41.1000 and 22.39.1002
In Openstack environment I want to activate OVN HW Offloading feature.
To do it OVN charm creates netplan file which is supposed to be
activated during the system restart, this configuration should also
ensure bonding and VF LAG using Mellanox card:
sudo cat /etc/netplan/150-charm-ovn.yaml
###############################################################################
# [ WARNING ]
# Configuration file maintained by Juju. Local changes may be overwritten.
# Config managed by ovn-chassis charm
###############################################################################
network:
version: 2
ethernets:
ens4f0np0:
virtual-function-count: 16
embedded-switch-mode: switchdev
delay-virtual-functions-rebind: true
ens4f1np1:
virtual-function-count: 16
embedded-switch-mode: switchdev
delay-virtual-functions-rebind: true
After that restart I see severe network performance degradation,
packet loss, bond lacp break.
When I check the configuration I see all VFs created but when I check
Mellanox ports usually 1 port is in switchdev mode while other in
Legacy mode. After the restart both ports remain in Legacy mode:
First restart:
ubuntu at ps7-r1-n3:~$ sudo dmesg | grep E-Switch
[ 29.130352] mlx5_core 0000:a1:00.0: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
[ 29.926778] mlx5_core 0000:a1:00.1: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
[ 41.694789] mlx5_core 0000:a1:00.0: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
[ 47.569787] mlx5_core 0000:a1:00.1: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
[ 62.082534] mlx5_core 0000:a1:00.0: E-Switch: Disable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
[ 63.706001] mlx5_core 0000:a1:00.0: E-Switch: Supported tc chains and prios offload
[ 67.174450] mlx5_core 0000:a1:00.0: E-Switch: Enable: mode(OFFLOADS), nvfs(16), necvfs(0), active vports(16)
$ sudo devlink dev eswitch show pci/0000:a1:00.0
pci/0000:a1:00.0: mode switchdev inline-mode none encap-mode basic
$ sudo devlink dev eswitch show pci/0000:a1:00.1
pci/0000:a1:00.1: mode legacy inline-mode none encap-mode basic
Second restart:
ubuntu at ps7-r1-n3:~$ sudo dmesg | grep E-Switch
[ 33.273497] mlx5_core 0000:a1:00.0: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
[ 33.947879] mlx5_core 0000:a1:00.1: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
[ 39.787807] mlx5_core 0000:a1:00.0: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
[ 45.648998] mlx5_core 0000:a1:00.1: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
$ sudo devlink dev eswitch show pci/0000:a1:00.0
pci/0000:a1:00.0: mode legacy inline-mode none encap-mode basic
$ $ sudo devlink dev eswitch show pci/0000:a1:00.1
pci/0000:a1:00.1: mode legacy inline-mode none encap-mode basic
Some logs can be found https://pastebin.canonical.com/p/kBbjYcpBBG/
Based on the manual approach to configure OVN HW offloading following
actions have to be done in a sequence:
1) bring up the virtual functions
2) unbind mlx5_core driver from VFs
3) switch PFs to switchdev mode
4) apply bond configuration
5) rebind mlx5 driver to the VFs
6) at the end apply regular network configuration
It looks like there is some problem is a sequential execution since
based on the logs output I see an error:
Sep 27 07:05:42 ps7-r1-n3 systemd-udevd[4395]: ens4f1np1: Process
'/usr/sbin/netplan apply --sriov-only' failed with exit code 1.
And it looks like switchdev mode enabling is tried at the time when
all or some VFs are still binded to the driver. Manual attempt to set
Mellanox card to the switchdev mode also fails when VF are binded to
the mlx5 driver:
sudo /usr/sbin/netplan apply --sriov-only
** (process:197812): WARNING **: 09:19:49.289: Permissions for
/etc/netplan/150-charm-ovn.yaml are too open. Netplan configuration
should NOT be accessible by others.
** (process:197812): WARNING **: 09:19:49.289: Permissions for
/etc/netplan/99-juju.yaml are too open. Netplan configuration should
NOT be accessible by others.
** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
** (process:197812): WARNING **: 09:19:49.289: Permissions for
/etc/netplan/150-charm-ovn.yaml are too open. Netplan configuration
should NOT be accessible by others.
** (process:197812): WARNING **: 09:19:49.289: Permissions for
/etc/netplan/99-juju.yaml are too open. Netplan configuration should
NOT be accessible by others.
** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
Error: mlx5_core: Can't change mode, E-Switch is busy.
kernel answers: Device or resource busy
Traceback (most recent call last):
File "/usr/sbin/netplan", line 23, in <module>
netplan.main()
File "/usr/share/netplan/netplan/cli/core.py", line 56, in main
self.run_command()
File "/usr/share/netplan/netplan/cli/utils.py", line 243, in run_command
self.func()
File "/usr/share/netplan/netplan/cli/commands/apply.py", line 63, in run
self.run_command()
File "/usr/share/netplan/netplan/cli/utils.py", line 243, in run_command
self.func()
File "/usr/share/netplan/netplan/cli/commands/apply.py", line 73, in command_apply
NetplanApply.process_sriov_config(config_manager, exit_on_error)
File "/usr/share/netplan/netplan/cli/commands/apply.py", line 402, in process_sriov_config
apply_sriov_config(config_manager)
File "/usr/share/netplan/netplan/cli/sriov.py", line 456, in apply_sriov_config
pcidev.devlink_set('eswitch', 'mode', eswitch_mode)
File "/usr/share/netplan/netplan/cli/sriov.py", line 143, in devlink_set
subprocess.check_call(
File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/sbin/devlink', 'dev', 'eswitch', 'set', 'pci/0000:a1:00.0', 'mode', 'switchdev']' returned non-zero exit status 1.
When I manually unbind all VFs from the mlx5 driver then I'm able to
switch to the switchdev mode.
To manage notifications about this bug go to:
https://bugs.launchpad.net/netplan/+bug/2083008/+subscriptions
More information about the foundations-bugs
mailing list