[SRU][P/N][PATCH 0/2] [UBUNTU 24.04] s390/pci: Don't abort recovery for user-space drivers (LP: #2121150)

Massimiliano Pellizzer massimiliano.pellizzer at canonical.com
Fri Aug 29 11:33:38 UTC 2025


BugLink: https://bugs.launchpad.net/bugs/2121150

[ Impact ]

s390/pci: Don't abort recovery for user-space drivers

When a PCI device under the control of a vfio-pci based user-space driver
encounters a PCI error event the subsequent error recovery flow in the kernel is
aborted because the vfio-pci driver only implements the error_detected PCI error
handler callback. This leaves the PCI device in the error state requiring
unbinding/re-binding of the driver to get it operational again instead of only
having to re-init the user-space driver.

According to the kernel documentation implementing only the error_detected()
callback from the error handling operations should be enough for minimal
recovery support. Contrary to this s390 so far required also the reset_slot()
and resume() callbacks to be implemented, otherwise recovery would be aborted.

Remove the requirement for the additional operations bringing s390 in line with
AER and EEH error recovery flows.

[ Fix ]

Backport the following commit from upstream:
- 62355f1f87b8 s390/pci: Allow automatic recovery with minimal driver support

[ Test Plan ]

Bind a PCI device to vfio-pci.
Start a user-space workload using the device.
Use the s390 PCI error injection interface to trigger a recoverable PCI error.
Observe kernel logs (dmesg) and confirm that the vfio-pci driver’s
error_detected() callback is invoked and recovery proceeds without abort.
After recovery, check that the device is functional again in the guest or user-
space application without requiring manual unbind/rebind.

[ Regression Potential ]

The fix affects how the s390 PCI error handler interprets missing callbacks and
the PCI_ERS_RESULT_NONE return code.
A bug here could cause the recovery flow to proceed when it should have aborted,
or to treat driver abstention as successful recovery even in faulty situations.
Users may see PCI devices reported as recovered but remaining non-functional,
recovery loops that repeatedly attempt to re-enable or reset devices, or devices
silently failing I/O without triggering the expected operator intervention.




More information about the kernel-team mailing list