ACK: [SRU][J][PATCH 0/1] [UBUNTU 22.04] s390/pci: Handle PCI error codes other than 0x3a (LP: #2120344)

Sat Sep 6 00:56:45 UTC 2025

On Fri, Aug 29, 2025 at 02:37:00PM +0200, Massimiliano Pellizzer wrote:
> BugLink: https://bugs.launchpad.net/bugs/2120344
> 
> [ Impact ]
> 
> s390/pci: Handle PCI error codes other than 0x3a
> 
> The Linux implementation of PCI error recovery for s390 was based on the
> understanding that firmware error recovery is a two step process with an
> optional initial error event to indicate the cause of the error if known
> followed by either error event 0x3A (Success) or 0x3B (Failure) to
> indicate whether firmware was able to recover. While this has been the
> case in testing and the error cases seen in the wild it turns out this
> is not correct. Instead firmware only generates 0x3A for some error and
> service scenarios and expects the OS to perform recovery for all PCI
> events codes except for those indicating permanent error (0x3B, 0x40)
> and those indicating errors on the function measurement block (0x2A,
> 0x2B, 0x2C). Align Linux behavior with these expectations.
> 
> [ Fix ]
> 
> Backport the following commit from upstream:
> - 3cd03ea57e8e s390/pci: Handle PCI error codes other than 0x3a
> 
> [ Test Plan ]
> 
> Boot the kernel on an IBM Z system with PCI devices bound to drivers that
> support error recovery and inject different PCI error codes.
> For recoverable errors, check that the device auto-recovers and remains usable.
> For 0x2A–0x2C errors, verify they are ignored with no impact.
> For permanent failure codes (0x3B, 0x40), confirm the device is correctly marked
> as failed.
> Use dmesg to validate that recovery paths are triggered as expected and that
> unaffected devices continue normal operation.
> 
> [ Regression Potential ]
> 
> The fix affects how the s390 PCI error handler interprets and reacts to PCI
> event codes. A bug here could cause error codes to be misclassified, leading to
> incorrect recovery actions.
> Users may see devices being driven into recovery even for irrelevant events,
> devices left permanently failed when they should have recovered, or recovery
> attempts being triggered repeatedly in a loop.
> 
> 

Acked-by: Tim Whisonant <tim.whisonant at canonical.com>