[Bug 2083008] Re: Enabling HW offloading to Mellanox cards causing packets storm and switch overloading

Danilo Egea Gondolfo 2083008 at bugs.launchpad.net
Wed Oct 2 13:18:53 UTC 2024


** Changed in: netplan
       Status: New => In Progress

** Also affects: netplan.io (Ubuntu)
   Importance: Undecided
       Status: New

** Also affects: netplan.io (Ubuntu Jammy)
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to netplan.io in Ubuntu.
Matching subscriptions: foundations-bugs
https://bugs.launchpad.net/bugs/2083008

Title:
  Enabling HW offloading to Mellanox cards causing packets storm and
  switch overloading

Status in Netplan:
  In Progress
Status in netplan.io package in Ubuntu:
  New
Status in netplan.io source package in Jammy:
  New

Bug description:
  Ubuntu 22.04.5 HWE Kernel 6.8.0-45-generic

  Mellanox ConnectX-6 Dx 
  Firmware: 22.41.1000 and 22.39.1002

  In Openstack environment I want to activate OVN HW Offloading feature.
  To do it OVN charm creates netplan file which is supposed to be
  activated during the system restart, this configuration should also
  ensure bonding and VF LAG using Mellanox card:

  sudo cat /etc/netplan/150-charm-ovn.yaml
  ###############################################################################
  # [ WARNING ]
  # Configuration file maintained by Juju. Local changes may be overwritten.
  # Config managed by ovn-chassis charm
  ###############################################################################
  network:
    version: 2
    ethernets:
      ens4f0np0:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true

      ens4f1np1:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true

  After that restart I see severe network performance degradation,
  packet loss, bond lacp break.

  When I check the configuration I see all VFs created but when I check
  Mellanox ports usually 1 port is in switchdev mode while other in
  Legacy mode. After the restart both ports remain in Legacy mode:

  First restart:
  ubuntu at ps7-r1-n3:~$ sudo dmesg | grep E-Switch
  [   29.130352] mlx5_core 0000:a1:00.0: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
  [   29.926778] mlx5_core 0000:a1:00.1: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
  [   41.694789] mlx5_core 0000:a1:00.0: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
  [   47.569787] mlx5_core 0000:a1:00.1: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
  [   62.082534] mlx5_core 0000:a1:00.0: E-Switch: Disable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
  [   63.706001] mlx5_core 0000:a1:00.0: E-Switch: Supported tc chains and prios offload
  [   67.174450] mlx5_core 0000:a1:00.0: E-Switch: Enable: mode(OFFLOADS), nvfs(16), necvfs(0), active vports(16)

  $ sudo devlink dev eswitch show pci/0000:a1:00.0
  pci/0000:a1:00.0: mode switchdev inline-mode none encap-mode basic

  $ sudo devlink dev eswitch show pci/0000:a1:00.1
  pci/0000:a1:00.1: mode legacy inline-mode none encap-mode basic

  Second restart:
  ubuntu at ps7-r1-n3:~$ sudo dmesg | grep E-Switch
  [   33.273497] mlx5_core 0000:a1:00.0: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
  [   33.947879] mlx5_core 0000:a1:00.1: E-Switch: Total vports 66, per vport: max uc(128) max mc(2048)
  [   39.787807] mlx5_core 0000:a1:00.0: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)
  [   45.648998] mlx5_core 0000:a1:00.1: E-Switch: Enable: mode(LEGACY), nvfs(16), necvfs(0), active vports(17)

  $ sudo devlink dev eswitch show pci/0000:a1:00.0
  pci/0000:a1:00.0: mode legacy inline-mode none encap-mode basic

  $ $ sudo devlink dev eswitch show pci/0000:a1:00.1
  pci/0000:a1:00.1: mode legacy inline-mode none encap-mode basic

  Some logs can be found https://pastebin.canonical.com/p/kBbjYcpBBG/

  Based on the manual approach to configure OVN HW offloading following
  actions have to be done in a sequence:

  1) bring up the virtual functions
  2) unbind mlx5_core driver from VFs
  3) switch PFs to switchdev mode
  4) apply bond configuration
  5) rebind mlx5 driver to the VFs
  6) at the end apply regular network configuration

  It looks like there is some problem is a sequential execution since
  based on the logs output I see an error:

  Sep 27 07:05:42 ps7-r1-n3 systemd-udevd[4395]: ens4f1np1: Process
  '/usr/sbin/netplan apply --sriov-only' failed with exit code 1.

  And it looks like switchdev mode enabling is tried at the time when
  all or some VFs are still binded to the driver. Manual attempt to set
  Mellanox card to the switchdev mode also fails when VF are binded to
  the mlx5 driver:

  sudo /usr/sbin/netplan apply --sriov-only

  ** (process:197812): WARNING **: 09:19:49.289: Permissions for
  /etc/netplan/150-charm-ovn.yaml are too open. Netplan configuration
  should NOT be accessible by others.

  ** (process:197812): WARNING **: 09:19:49.289: Permissions for
  /etc/netplan/99-juju.yaml are too open. Netplan configuration should
  NOT be accessible by others.

  ** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
  See the 'Default routes' section of the documentation for more details.

  ** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
  See the 'Default routes' section of the documentation for more details.

  ** (process:197812): WARNING **: 09:19:49.289: Permissions for
  /etc/netplan/150-charm-ovn.yaml are too open. Netplan configuration
  should NOT be accessible by others.

  ** (process:197812): WARNING **: 09:19:49.289: Permissions for
  /etc/netplan/99-juju.yaml are too open. Netplan configuration should
  NOT be accessible by others.

  ** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
  See the 'Default routes' section of the documentation for more details.

  ** (process:197812): WARNING **: 09:19:49.289: `gateway4` has been deprecated, use default routes instead.
  See the 'Default routes' section of the documentation for more details.
  Error: mlx5_core: Can't change mode, E-Switch is busy.
  kernel answers: Device or resource busy
  Traceback (most recent call last):
    File "/usr/sbin/netplan", line 23, in <module>
      netplan.main()
    File "/usr/share/netplan/netplan/cli/core.py", line 56, in main
      self.run_command()
    File "/usr/share/netplan/netplan/cli/utils.py", line 243, in run_command
      self.func()
    File "/usr/share/netplan/netplan/cli/commands/apply.py", line 63, in run
      self.run_command()
    File "/usr/share/netplan/netplan/cli/utils.py", line 243, in run_command
      self.func()
    File "/usr/share/netplan/netplan/cli/commands/apply.py", line 73, in command_apply
      NetplanApply.process_sriov_config(config_manager, exit_on_error)
    File "/usr/share/netplan/netplan/cli/commands/apply.py", line 402, in process_sriov_config
      apply_sriov_config(config_manager)
    File "/usr/share/netplan/netplan/cli/sriov.py", line 456, in apply_sriov_config
      pcidev.devlink_set('eswitch', 'mode', eswitch_mode)
    File "/usr/share/netplan/netplan/cli/sriov.py", line 143, in devlink_set
      subprocess.check_call(
    File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['/sbin/devlink', 'dev', 'eswitch', 'set', 'pci/0000:a1:00.0', 'mode', 'switchdev']' returned non-zero exit status 1.

  When I manually unbind all VFs from the mlx5 driver then I'm able to
  switch to the switchdev mode.

To manage notifications about this bug go to:
https://bugs.launchpad.net/netplan/+bug/2083008/+subscriptions




More information about the foundations-bugs mailing list