ACK: [SRU][F][PATCH 0/2] Kernel panic with "refcount_t: underflow" in mlx5 driver (LP: 2019011)
Kleber Souza
kleber.sacilotto.de.souza at canonical.com
Thu Jul 6 13:53:32 UTC 2023
On 28.06.23 12:04, frank.heimes at canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2019011
>
> SRU Justification:
>
> [ Impact ]
>
> * The mlx5 driver is causing a Kernel panic with
> "refcount_t: underflow".
>
> * This issue occurs during a recovery when the PCI device
> is isolated and thus doesn't respond.
>
> [ Fix ]
>
> * This issue got solved upstream with
> aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
> "net/mlx5: Fix handling of entry refcount when command
> is not issued to FW" (upstream since 6.1-rc1)
>
> * But to get aaf2e65cac7f a backport of b898ce7bccf1
> b898ce7bccf13087719c021d829dab607c175246
> "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
> not accessible" is required on top (upstream since 5.10)
>
> [ Test Plan ]
>
> * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
> is needed that has Mellanox cards (RoCE Express 2.1)
> assigned, configured and enabled and that runs a 5.4
> kernel with mlx5 driver.
>
> * Create some network traffic on (one of the) RoCE device
> (interface ens???[d?]) for testing (e.g. with stress-ng).
>
> * Make sure the module/driver mlx5 is loaded and in use.
>
> * Trigger a recovery (via the Support Element)
> that will render the adapter (ports) unresponsive
> for a moment and should provoke a similar situation.
>
> * Alternatively the interface itself can be removed for
> a moment and re-added again (but this may break further
> things on top).
>
> * Due to the lack of RoCE Express 2.1 hardware,
> the verification is on IBM.
>
> [ Where problems could occur ]
>
> * The modifications are limited to the Mellanox mlx5 driver
> only - no other network driver is affected.
>
> * The pre-required commit (aaf2e65cac7f) can have a bad
> impact on (re-)claiming pages if FW is not accessible,
> which could cause page leaks in case done wrong.
> But this commit is pretty save since it's upstream
> since v5.10.
>
> * The fix itself (aaf2e65cac7f) mainly changes the
> cmd_work_handler and mlx5_cmd_comp_handler functions
> in a way that instead of pci_channel_offline
> mlx5_cmd_is_down (introiduced by b898ce7bccf1).
>
> * Actually b898ce7bccf1 started with changing from
> pci_channel_offline to mlx5_cmd_is_down,
> but looks like a few cases
> (in the area of refcount increate/decrease) were missed,
> that are now covered by aaf2e65cac7f.
>
> * It fixes now on top refcounts are now always properly
> increment and decrement to achieve a symmetric state
> for all flows.
>
> * These changes may have an impact on all cases where the
> mlx5 device is not responding, which can happen in case
> of an offline channel, interface down, reset or recovery.
>
> [ Other Info ]
>
> * A lookup at the master-next git trees for jammy, kinetic
> and lunar showed that both fixes are already included,
> hence only focal is affected.
>
> Moshe Shemesh (1):
> net/mlx5: Fix handling of entry refcount when command is not issued to
> FW
>
> Saeed Mahameed (1):
> net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible
>
> drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
> .../ethernet/mellanox/mlx5/core/pagealloc.c | 2 +-
> include/linux/mlx5/driver.h | 1 +
> 3 files changed, 14 insertions(+), 12 deletions(-)
>
LTGM.
Acked-by: Kleber Sacilotto de Souza <kleber.souza at canonical.com>
Thanks
More information about the kernel-team
mailing list