[Bug 1988018] Re: [SRU][mlx5] Intermittent VF-LAG activation failure

Lukas Märdian 1988018 at bugs.launchpad.net
Tue Jan 7 14:48:17 UTC 2025


I tested netplan.io 0.107.1-3ubuntu0.22.04.2 from jammy-proposed, all looking good!
The intermittent failures reported in comment #12 are resolved.

First of all, the eswitch/switchdev functionality is not available on Jammy's GA 5.15 kernel,
so I upgraded to the HWE kernel and installed Netplan from proposed:

ubuntu at akis:~$ sudo devlink dev eswitch show pci/0000:86:00.0
kernel answers: Operation not supported
ubuntu at akis:~$ sudo apt-get install --install-recommends linux-generic-hwe-22.04
ubuntu at akis:~$ uname -a
Linux akis 6.8.0-51-generic #52~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Dec  9 15:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
ubuntu at akis:~$ sudo apt install -t jammy-proposed netplan.io
ubuntu at akis:~$ apt list *netplan*
Listing... Done
libnetplan-dev/jammy-proposed 0.107.1-3ubuntu0.22.04.2 amd64
libnetplan0/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64 [installed,automatic]
netplan-generator/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64 [installed,automatic]
netplan.io/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64 [installed,automatic]
python3-netplan/jammy-proposed,now 0.107.1-3ubuntu0.22.04.2 amd64 [installed,automatic]

Next, I identified the Mellanox ConnectX-5 NIC (enp134...0/1) and confrimed that LAG is disabled:
ubuntu at akis:~$ sudo lshw -c network -businfo
Bus info          Device          Class          Description
============================================================
pci at 0000:06:00.0  enp6s0          network        I210 Gigabit Network Connection
pci at 0000:35:00.0  enp53s0np0      network        MT27800 Family [ConnectX-5]
pci at 0000:3a:00.0  enp58s0np0      network        MT27800 Family [ConnectX-5]
pci at 0000:58:00.0  enp88s0np0      network        MT27800 Family [ConnectX-5]
pci at 0000:5d:00.0  enp93s0np0      network        MT27800 Family [ConnectX-5]
pci at 0000:86:00.0  enp134s0f0np0   network        MT27800 Family [ConnectX-5]
pci at 0000:86:00.1  enp134s0f1np1   network        MT27800 Family [ConnectX-5]
pci at 0000:b8:00.0  enp184s0np0     network        MT27800 Family [ConnectX-5]
pci at 0000:bd:00.0  enp189s0np0     network        MT27800 Family [ConnectX-5]
pci at 0000:e1:00.0  enp225s0np0     network        MT27800 Family [ConnectX-5]
pci at 0000:e6:00.0  enp230s0np0     network        MT27800 Family [ConnectX-5]
ubuntu at akis:~$ sudo cat /sys/kernel/debug/mlx5/0000:86:00.0/lag/state
disabled
ubuntu at akis:~$ sudo cat /sys/kernel/debug/mlx5/0000:86:00.1/lag/state
disabled


I changed the Netplan configuration according to the test plan above, and rebooted the system:
ubuntu at akis:~$ sudo netplan get

** (process:3196): WARNING **: 14:22:14.283: `gateway4` has been deprecated, use default routes instead.
See the 'Default routes' section of the documentation for more details.
network:
  version: 2
  ethernets:
    enp134s0f0np0:
      optional: true
      virtual-function-count: 8
      embedded-switch-mode: "switchdev"
      delay-virtual-functions-rebind: true
    enp134s0f1np1:
      optional: true
      virtual-function-count: 8
      embedded-switch-mode: "switchdev"
      delay-virtual-functions-rebind: true
[...]
  bonds:
    bond0:
      interfaces:
      - enp134s0f0np0
      - enp134s0f1np1
      parameters:
        mode: "active-backup"


After the reboot the link-aggregation (LAG) is activated and the bond0 is up:
ubuntu at akis:~$ sudo cat /sys/kernel/debug/mlx5/0000:86:00.1/lag/state
active
ubuntu at akis:~$ sudo cat /sys/kernel/debug/mlx5/0000:86:00.0/lag/state
active
ubuntu at akis:~$ netplan status bond0
     Online state: online
    DNS Addresses: 127.0.0.53 (stub)
       DNS Search: maas

● 13: bond0 bond UP (networkd: bond0)
      MAC Address: ce:03:e9:7f:f9:9d
        Addresses: fe80::cc03:e9ff:fe7f:f99d/64 (link)
           Routes: fe80::/64 metric 256

12 inactive interfaces hidden. Use "--all" to show all.

** Tags removed: verification-needed verification-needed-jammy
** Tags added: verification-done verification-done-jammy

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to a duplicate bug report
(1977851).
https://bugs.launchpad.net/bugs/1988018

Title:
  [SRU][mlx5] Intermittent VF-LAG activation failure

Status in linux package in Ubuntu:
  Fix Committed
Status in netplan.io package in Ubuntu:
  Fix Released
Status in linux source package in Jammy:
  Confirmed
Status in netplan.io source package in Jammy:
  Fix Committed
Status in linux source package in Kinetic:
  Won't Fix
Status in netplan.io source package in Kinetic:
  Won't Fix
Status in linux source package in Mantic:
  Won't Fix
Status in netplan.io source package in Mantic:
  Won't Fix
Status in linux source package in Noble:
  Fix Committed
Status in netplan.io source package in Noble:
  Fix Released

Bug description:
  [ Impact ]

  Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG
  feature found on Mellanox NICs couldn't be used. Certain configuration steps
  must happen in a very specific order and Netplan fails to perform the set up correctly.

  Netplan must wait until the backend finishes adding interfaces to the Bond
  and the Mellanox driver reports the VF-LAG feature as "active" before binding VFs to
  the driver.

  See also https://bugs.launchpad.net/netplan/+bug/2083008

  This problem is fixed by introducing a proper ordering in the configuration process
  and monitoring the driver state until it reports as ready (or times out).

  This fix is available on Ubuntu 24.04.

  [ Test Plan ]

  To reproduce the problem addressed by this SRU one needs to
  have access to specialized hardware (SR-IOV-capable Mellanox NICs).

  The fix for the problem described above was already verified on Ubuntu 22.04 and
  solved the problem (more details https://bugs.launchpad.net/netplan/+bug/2083008).

  We will work with Canonical's Openstack team to do the fix
  verification.

   * detailed instructions how to reproduce the bug

  A configuration file that looks like the one below can be used
  to test the fix.

  After booting the system with this configuration, the Mellanox driver
  should report the LAG state as "active" for all the devices.
  It can be checked in the debugfs file: /sys/kernel/debug/mlx5/{pci_addr}/lag/state

  network:
    version: 2
    ethernets:
      ens4f0np0:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true

      ens4f1np1:
        virtual-function-count: 16
        embedded-switch-mode: switchdev
        delay-virtual-functions-rebind: true

    bonds:
      bond0:
        interfaces:
          - ens4f0np0
          - ens4f1np1
        parameters:
          mode: active-backup

  [ Where problems could occur ]

  These changes should affect only SR-IOV related scenarios.
  Undetected problems could cause Netplan to fail to configure the device
  and Virtual Functions wouldn't be created anymore.

  [ Other Info ]

  Related work:

  https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018
  https://github.com/canonical/netplan/pull/439

  A PPA for Ubuntu 22.04 can be found here
  https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sru

  ---- Original bug description ----

  During system initialization there is a specific sequence that must be
  followed to enable the use of hardware offload and VF-LAG.

  Intermittently one may see that VF-LAG initialization fails:
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22)
  [Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG
                             Make sure all VFs are unbound prior to VF LAG activation or deactivation

  This is caused by rebinding the driver prior to the VF lag being
  ready.

  A sysfs knob has recently been added to the driver [0] and we should
  monitor it before attempting to rebind the driver:

      $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state

  The kernel feature is available in the upcoming Kinetic 5.19 kernel
  and we should probably backport it to the Jammy 5.15 kernel.

  0:
  https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1988018/+subscriptions




More information about the foundations-bugs mailing list