APPLIED: [SRU][N/P][PATCH 0/1] [UBUNTU 24.04] s390/pci: Fix stale function handles in error handling (LP: #2121149)
Stefan Bader
stefan.bader at canonical.com
Fri Sep 12 10:23:39 UTC 2025
On 29/08/2025 09:49, Massimiliano Pellizzer wrote:
> BugLink: https://bugs.launchpad.net/bugs/2121149
>
> [ Impact ]
>
> s390/pci: Fix stale function handles in error handling
>
> In some error scenarios multiple error events may be generated for the same PCI
> function before Linux even started its automatic recovery process. In this case
> Linux may succeed to recover based on the first event but then fails recovery
> when handling a subsequent event. This is because events capture the function
> handle as they are created. At the time when the secondary event is handled the
> handle stored with the error event is then stale and using it to reset the
> function will fail.
>
> Fix this by retrieving a fresh function handle using the CLP List PCI Functions
> and only process events where the stored handle matches this handle. This
> effectively ignores error events which were captured before the latest
> disable/enable cycles. Relatedly if the current handle is already disabled do
> not attempt to simply reset the error state as a re-enable is necessary and
> clearing the error state would fail.
>
> [ Fix ]
>
> Backport the following commits from upstream:
> - 45537926dd2a s390/pci: Fix stale function handles in error handling
> - b97a7972b1f4 s390/pci: Do not try re-enabling load/store if device is disabled
>
> [ Test Plan ]
>
> Booting the system on a IBM Z mainframe with at least one PCI passthrough device
> available.
> Enable debug logging in order to monitor how error events are processed in real
> time.
> Trigger PCI error conditions, either through firmware error injection or by
> repeatedly disabling and re-enabling the device under load using sysfs
> interfaces.
> While the device is busy handling real traffic, such as network or crypto
> operations, watch the kernel logs to see how error events are processed.
> Verify that events carrying stale function handles are detected and ignored, and
> that recovery attempts against disabled devices escalate properly to a full
> reset.
>
> [ Regression Potential ]
>
> The fix affects how the s390 PCI error handler validates and uses function
> handles during recovery.
> A bug here could cause valid error events to be incorrectly ignored or recovery
> paths to escalate unnecessarily.
> Users may see PCI devices not recovering from transient errors, devices being
> reset or re-enabled more often than required, or even unexpected device removal.
>
>
Applied to noble:linux/master-next. For Plucky this was already applied via:
"Plucky update: upstream stable patchset 2025-09-04". Bug references
were updated to include LP: #2121149. Thanks.
-Stefan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250912/f0b0d1d8/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250912/f0b0d1d8/attachment-0001.sig>
More information about the kernel-team
mailing list