NAK: [PATCH 0/2][SRU][B/E/F/U/OEM-OSP1-B/OEM-5.6] PCI: Avoid FLR for AMD Matisse/Starship HD Audio & USB 3.0

Mon Jun 1 16:25:00 UTC 2020

On Fri, May 29, 2020 at 06:02:50PM +0800, You-Sheng Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/1865988
> 
> [Impact]
> 
> Devices affected:
> 
> * [1022:148c] USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Starship
>   USB 3.0 Host Controller
> * [1022:149c] USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse
>   USB 3.0 Host Controller
> * [1022:1487] Audio device [0403]: Advanced Micro Devices, Inc. [AMD]
>   Starship/Matisse HD Audio Controller
> 
> Despite advertising FLReset device capabilities, performing a function level
> reset of either of these devices causes the system to lock up. This is of
> particular issue where these devices appear in their own IOMMU groups and are
> well suited to VFIO passthrough.
> 
> Issue was introduced in AMD's "AGESA Combo-AM4 1.0.0.4 Patch B" microcode
> update, and affects dozens of motherboard models across various vendors.
> 
> Additional discussion of this issue:
> https://www.reddit.com/r/VFIO/comments/eba5mh/workaround_patch_for_passing_through_usb_and/
> 
> [Fix]
> 
> Two commits currently landed in linux-pci pci/virutualization:
> * 0d14f06cd665 PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0
> * 5727043c73fd PCI: Avoid FLR for AMD Starship USB 3.0
> 
> [Test Case]
> 
> Peform the test on an impacted system:
> 
> * B350, B450, X370, X470, X570 motherboards (practically anything with an AM4
>   socket);
> * Ryzen 3000-series CPU (2000-series possibly also affected);
> * BIOS/UEFI firmware that includes "AGESA Combo-AM4 1.0.0.4 Patch B" (check
>   vendor release notes)
> 
> In the above case where '0000:10:00.3' is the USB controller '1022:149c', issue
> a reset command:
> 
>   $ echo 1 | sudo tee /sys/bus/pci/devices/0000\:10\:00.3/reset
> 
> Impacted systems will not return successfully and become unstable, requiring a
> reboot. `/var/logs/syslog` will show something resembling the following:
> 
>   xhci_hcd 0000:10:00.3: not ready 1023ms after FLR; waiting
>   xhci_hcd 0000:10:00.3: not ready 2047ms after FLR; waiting
>   xhci_hcd 0000:10:00.3: not ready 4095ms after FLR; waiting
>   xhci_hcd 0000:10:00.3: not ready 8191ms after FLR; waiting
>   xhci_hcd 0000:10:00.3: not ready 16383ms after FLR; waiting
>   xhci_hcd 0000:10:00.3: not ready 32767ms after FLR; waiting
>   xhci_hcd 0000:10:00.3: not ready 65535ms after FLR; giving up
>   clocksource: timekeeping watchdog on CPU14: Marking clocksource 'tsc' as unstable because the skew is too large:
>   clocksource: 'hpet' wd_now: f63fcfe wd_last: d468894 mask: ffffffff
>   clocksource: 'tsc' cs_now: 60e67e17758 cs_last: 60d2a81ce24 mask: ffffffffffffffff
>   tsc: Marking TSC unstable due to clocksource watchdog
>   TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
>   sched_clock: Marking unstable (1817664630139, 314261908)<-(1817981099530, -2209419)
> 
> [Regression Risk]
> Low. These two patches affect only systems with a device needs fix.

Could you please update the cherry-picked lines to indicate the full URI
to the repo these patche came from? We allow the shorthand linux-next in
cases where that is well known, but in other cases it is prefereable to
have the full URI so there is no ambiguity.

Thanks,
Seth