NAK: [PATCH 0/2][SRU][B/E/F/U/OEM-OSP1-B/OEM-5.6] PCI: Avoid FLR for AMD Matisse/Starship HD Audio & USB 3.0
Seth Forshee
seth.forshee at canonical.com
Mon Jun 1 16:25:00 UTC 2020
On Fri, May 29, 2020 at 06:02:50PM +0800, You-Sheng Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/1865988
>
> [Impact]
>
> Devices affected:
>
> * [1022:148c] USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Starship
> USB 3.0 Host Controller
> * [1022:149c] USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse
> USB 3.0 Host Controller
> * [1022:1487] Audio device [0403]: Advanced Micro Devices, Inc. [AMD]
> Starship/Matisse HD Audio Controller
>
> Despite advertising FLReset device capabilities, performing a function level
> reset of either of these devices causes the system to lock up. This is of
> particular issue where these devices appear in their own IOMMU groups and are
> well suited to VFIO passthrough.
>
> Issue was introduced in AMD's "AGESA Combo-AM4 1.0.0.4 Patch B" microcode
> update, and affects dozens of motherboard models across various vendors.
>
> Additional discussion of this issue:
> https://www.reddit.com/r/VFIO/comments/eba5mh/workaround_patch_for_passing_through_usb_and/
>
> [Fix]
>
> Two commits currently landed in linux-pci pci/virutualization:
> * 0d14f06cd665 PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0
> * 5727043c73fd PCI: Avoid FLR for AMD Starship USB 3.0
>
> [Test Case]
>
> Peform the test on an impacted system:
>
> * B350, B450, X370, X470, X570 motherboards (practically anything with an AM4
> socket);
> * Ryzen 3000-series CPU (2000-series possibly also affected);
> * BIOS/UEFI firmware that includes "AGESA Combo-AM4 1.0.0.4 Patch B" (check
> vendor release notes)
>
> In the above case where '0000:10:00.3' is the USB controller '1022:149c', issue
> a reset command:
>
> $ echo 1 | sudo tee /sys/bus/pci/devices/0000\:10\:00.3/reset
>
> Impacted systems will not return successfully and become unstable, requiring a
> reboot. `/var/logs/syslog` will show something resembling the following:
>
> xhci_hcd 0000:10:00.3: not ready 1023ms after FLR; waiting
> xhci_hcd 0000:10:00.3: not ready 2047ms after FLR; waiting
> xhci_hcd 0000:10:00.3: not ready 4095ms after FLR; waiting
> xhci_hcd 0000:10:00.3: not ready 8191ms after FLR; waiting
> xhci_hcd 0000:10:00.3: not ready 16383ms after FLR; waiting
> xhci_hcd 0000:10:00.3: not ready 32767ms after FLR; waiting
> xhci_hcd 0000:10:00.3: not ready 65535ms after FLR; giving up
> clocksource: timekeeping watchdog on CPU14: Marking clocksource 'tsc' as unstable because the skew is too large:
> clocksource: 'hpet' wd_now: f63fcfe wd_last: d468894 mask: ffffffff
> clocksource: 'tsc' cs_now: 60e67e17758 cs_last: 60d2a81ce24 mask: ffffffffffffffff
> tsc: Marking TSC unstable due to clocksource watchdog
> TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> sched_clock: Marking unstable (1817664630139, 314261908)<-(1817981099530, -2209419)
>
> [Regression Risk]
> Low. These two patches affect only systems with a device needs fix.
Could you please update the cherry-picked lines to indicate the full URI
to the repo these patche came from? We allow the shorthand linux-next in
cases where that is well known, but in other cases it is prefereable to
have the full URI so there is no ambiguity.
Thanks,
Seth
More information about the kernel-team
mailing list