[SRU][N/P][PATCH 0/1] [UBUNTU 24.04] s390/pci: Fix stale function handles in error handling (LP: #2121149)

Massimiliano Pellizzer massimiliano.pellizzer at canonical.com
Fri Aug 29 07:49:57 UTC 2025


BugLink: https://bugs.launchpad.net/bugs/2121149

[ Impact ]

s390/pci: Fix stale function handles in error handling

In some error scenarios multiple error events may be generated for the same PCI
function before Linux even started its automatic recovery process. In this case
Linux may succeed to recover based on the first event but then fails recovery
when handling a subsequent event. This is because events capture the function
handle as they are created. At the time when the secondary event is handled the
handle stored with the error event is then stale and using it to reset the
function will fail.

Fix this by retrieving a fresh function handle using the CLP List PCI Functions
and only process events where the stored handle matches this handle. This
effectively ignores error events which were captured before the latest
disable/enable cycles. Relatedly if the current handle is already disabled do
not attempt to simply reset the error state as a re-enable is necessary and
clearing the error state would fail.

[ Fix ]

Backport the following commits from upstream:
- 45537926dd2a s390/pci: Fix stale function handles in error handling
- b97a7972b1f4 s390/pci: Do not try re-enabling load/store if device is disabled

[ Test Plan ]

Booting the system on a IBM Z mainframe with at least one PCI passthrough device
available.
Enable debug logging in order to monitor how error events are processed in real
time.
Trigger PCI error conditions, either through firmware error injection or by
repeatedly disabling and re-enabling the device under load using sysfs
interfaces.
While the device is busy handling real traffic, such as network or crypto
operations, watch the kernel logs to see how error events are processed.
Verify that events carrying stale function handles are detected and ignored, and
that recovery attempts against disabled devices escalate properly to a full
reset.

[ Regression Potential ]

The fix affects how the s390 PCI error handler validates and uses function
handles during recovery.
A bug here could cause valid error events to be incorrectly ignored or recovery
paths to escalate unnecessarily.
Users may see PCI devices not recovering from transient errors, devices being
reset or re-enabled more often than required, or even unexpected device removal.




More information about the kernel-team mailing list