APPLIED: [SRU][Q:gcp][PATCH 0/5] Fix for attestation failure due to vTPM device error

Fri Dec 19 21:43:22 UTC 2025

On Wed, Dec 03, 2025 at 04:56:27PM -0800, Tim Whisonant wrote:
> BugLink: https://bugs.launchpad.net/bugs/2133834
> 
> SRU Justification:
> 
> [Impact]
> 
> x86/kvm: Override default caching mode for SEV-SNP and TDX
> 
> AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not
> advertised in CPUID or it cannot be programmed (on TDX, due to #VE on
> CR0.CD clear).
> 
> This results in guests using uncached mappings where it shouldn't and
> pmd/pud_set_huge() failures due to non-uniform memory type reported by
> mtrr_type_lookup().
> 
> Override MTRR state, making it WB by default as the kernel does for
> Hyper-V guests.
> 
> 
> x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
> 
> Rename the helper to better reflect its function.
> 
> 
> x86/kvm: Make kvm_async_pf_task_wake() a local static helper
> 
> Make kvm_async_pf_task_wake() static and drop its export, as the symbol is
> only referenced from within kvm.c.
> 
> No functional change intended.
> 
> 
> x86/kvm: Force legacy PCI hole to UC when overriding MTRRs for TDX/SNP
> 
> When running as an SNP or TDX guest under KVM, force the legacy PCI hole,
> i.e. memory between Top of Lower Usable DRAM and 4GiB, to be mapped as UC
> via a forced variable MTRR range.
> 
> In most KVM-based setups, legacy devices such as the HPET and TPM are
> enumerated via ACPI.  ACPI enumeration includes a Memory32Fixed entry, and
> optionally a SystemMemory descriptor for an OperationRegion, e.g. if the
> device needs to be accessed via a Control Method.
> 
> If a SystemMemory entry is present, then the kernel's ACPI driver will
> auto-ioremap the region so that it can be accessed at will.  However, the
> ACPI spec doesn't provide a way to enumerate the memory type of
> SystemMemory regions, i.e. there's no way to tell software that a region
> must be mapped as UC vs. WB, etc.  As a result, Linux's ACPI driver always
> maps SystemMemory regions using ioremap_cache(), i.e. as WB on x86.
> 
> The dedicated device drivers however, e.g. the HPET driver and TPM driver,
> want to map their associated memory as UC or WC, as accessing PCI devices
> using WB is unsupported.
> 
> On bare metal and non-CoCO, the conflicting requirements "work" as firmware
> configures the PCI hole (and other device memory) to be UC in the MTRRs.
> So even though the ACPI mappings request WB, they are forced to UC- in the
> kernel's tracking due to the kernel properly handling the MTRR overrides,
> and thus are compatible with the drivers' requested WC/UC-.
> 
> With force WB MTRRs on SNP and TDX guests, the ACPI mappings get their
> requested WB if the ACPI mappings are established before the dedicated
> driver code attempts to initialize the device.  E.g. if acpi_init()
> runs before the corresponding device driver is probed, ACPI's WB mapping
> will "win", and result in the driver's ioremap() failing because the
> existing WB mapping isn't compatible with the requested WC/UC-.
> 
> E.g. when a TPM is emulated by the hypervisor (ignoring the security
> implications of relying on what is allegedly an untrusted entity to store
> measurements), the TPM driver will request UC and fail:
> 
>   [  1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
>   [  1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12
> 
> Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's
> memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC).
> E.g. tracing mapping requests for TPM TIS yields:
> 
>  Mapping TPM TIS with req_type = 0
>  WARNING: CPU: 22 PID: 1 at arch/x86/mm/pat/memtype.c:530 memtype_reserve+0x2ab/0x460
>  Modules linked in:
>  CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.16.0-rc7+ #2 VOLUNTARY
>  Tainted: [W]=WARN
>  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/29/2025
>  RIP: 0010:memtype_reserve+0x2ab/0x460
>   __ioremap_caller+0x16d/0x3d0
>   ioremap_cache+0x17/0x30
>   x86_acpi_os_ioremap+0xe/0x20
>   acpi_os_map_iomem+0x1f3/0x240
>   acpi_os_map_memory+0xe/0x20
>   acpi_ex_system_memory_space_handler+0x273/0x440
>   acpi_ev_address_space_dispatch+0x176/0x4c0
>   acpi_ex_access_region+0x2ad/0x530
>   acpi_ex_field_datum_io+0xa2/0x4f0
>   acpi_ex_extract_from_field+0x296/0x3e0
>   acpi_ex_read_data_from_field+0xd1/0x460
>   acpi_ex_resolve_node_to_value+0x2ee/0x530
>   acpi_ex_resolve_to_value+0x1f2/0x540
>   acpi_ds_evaluate_name_path+0x11b/0x190
>   acpi_ds_exec_end_op+0x456/0x960
>   acpi_ps_parse_loop+0x27a/0xa50
>   acpi_ps_parse_aml+0x226/0x600
>   acpi_ps_execute_method+0x172/0x3e0
>   acpi_ns_evaluate+0x175/0x5f0
>   acpi_evaluate_object+0x213/0x490
>   acpi_evaluate_integer+0x6d/0x140
>   acpi_bus_get_status+0x93/0x150
>   acpi_add_single_object+0x43a/0x7c0
>   acpi_bus_check_add+0x149/0x3a0
>   acpi_bus_check_add_1+0x16/0x30
>   acpi_ns_walk_namespace+0x22c/0x360
>   acpi_walk_namespace+0x15c/0x170
>   acpi_bus_scan+0x1dd/0x200
>   acpi_scan_init+0xe5/0x2b0
>   acpi_init+0x264/0x5b0
>   do_one_initcall+0x5a/0x310
>   kernel_init_freeable+0x34f/0x4f0
>   kernel_init+0x1b/0x200
>   ret_from_fork+0x186/0x1b0
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
> 
> The above traces are from a Google-VMM based VM, but the same behavior
> happens with a QEMU based VM that is modified to add a SystemMemory range
> for the TPM TIS address space.
> 
> The only reason this doesn't cause problems for HPET, which appears to
> require a SystemMemory region, is because HPET gets special treatment via
> x86_init.timers.timer_init(), and so gets a chance to create its UC-
> mapping before acpi_init() clobbers things.  Disabling the early call to
> hpet_time_init() yields the same behavior for HPET:
> 
>   [  0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0
> 
> Hack around the ACPI gap by forcing the legacy PCI hole to UC when
> overriding the (virtual) MTRRs for CoCo guest, so that ioremap handling
> of MTRRs naturally kicks in and forces the ACPI mappings to be UC.
> 
> Note, the requested/mapped memtype doesn't actually matter in terms of
> accessing the device.  In practically every setup, legacy PCI devices are
> emulated by the hypervisor, and accesses are intercepted and handled as
> emulated MMIO, i.e. never access physical memory and thus don't have an
> effective memtype.
> 
> Even in a theoretical setup where such devices are passed through by the
> host, i.e. point at real MMIO memory, it is KVM's (as the hypervisor)
> responsibility to force the memory to be WC/UC, e.g. via EPT memtype
> under TDX or real hardware MTRRs under SNP.  Not doing so cannot work,
> and the hypervisor is highly motivated to do the right thing as letting
> the guest access hardware MMIO with WB would likely result in a variety
> of fatal #MCs.
> 
> In other words, forcing the range to be UC is all about coercing the
> kernel's tracking into thinking that it has established UC mappings, so
> that the ioremap code doesn't reject mappings from e.g. the TPM driver and
> thus prevent the driver from loading and the device from functioning.
> 
> Note #2, relying on guest firmware to handle this scenario, e.g. by setting
> virtual MTRRs and then consuming them in Linux, is not a viable option, as
> the virtual MTRR state is managed by the untrusted hypervisor, and because
> OVMF at least has stopped programming virtual MTRRs when running as a TDX
> guest.
> 
> 
> x86/kvm: Prefer native qspinlock for dedicated vCPUs irrespective of PV_UNHALT
> 
> The commit b2798ba0b876 ("KVM: X86: Choose qspinlock when dedicated
> physical CPUs are available") states that when PV_DEDICATED=1
> (vCPU has dedicated pCPU), qspinlock should be preferred regardless of
> PV_UNHALT.  However, the current implementation doesn't reflect this: when
> PV_UNHALT=0, we still use virt_spin_lock() even with dedicated pCPUs.
> 
> This is suboptimal because:
> 1. Native qspinlocks should outperform virt_spin_lock() for dedicated
>    vCPUs irrespective of HALT exiting
> 2. virt_spin_lock() should only be preferred when vCPUs may be preempted
>    (non-dedicated case)
> 
> So reorder the PV spinlock checks to:
> 1. First handle dedicated pCPU case (disable virt_spin_lock_key)
> 2. Second check single CPU, and nopvspin configuration
> 3. Only then check PV_UNHALT support
> 
> This ensures we always use native qspinlock for dedicated vCPUs, delivering
> pretty performance gains at high contention levels.
> 
> [Fix]
> 
> Questing: 5 clean cherry-picks from upstream main.
> 
> [Test Plan]
> 
> Compile and boot tested, then manually confirmed that syslog
> did not contain the error message "TPM fail - No TPM chip found."
> 
> [Where problems could occur]
> 
> The likelihood of errors is considered low, as each patch was
> a clean cherry-pick from upstream. If errors would result from
> the patches, they may manifest as vTPM failures or system hangs.
> 
> Kirill A. Shutemov (2):
>   x86/kvm: Override default caching mode for SEV-SNP and TDX
>   x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
> 
> Li RongQing (1):
>   x86/kvm: Prefer native qspinlock for dedicated vCPUs irrespective of
>     PV_UNHALT
> 
> Sean Christopherson (2):
>   x86/kvm: Make kvm_async_pf_task_wake() a local static helper
>   x86/kvm: Force legacy PCI hole to UC when overriding MTRRs for TDX/SNP
> 
>  arch/x86/hyperv/ivm.c              |  2 +-
>  arch/x86/include/asm/kvm_para.h    |  2 --
>  arch/x86/include/asm/mtrr.h        | 10 +++----
>  arch/x86/kernel/cpu/mtrr/generic.c |  6 ++--
>  arch/x86/kernel/cpu/mtrr/mtrr.c    |  2 +-
>  arch/x86/kernel/kvm.c              | 44 ++++++++++++++++++++++--------
>  arch/x86/xen/enlighten_pv.c        |  4 +--
>  7 files changed, 44 insertions(+), 26 deletions(-)
> 
> -- 
> 2.43.0
> 

Applied to questing:linux-gcp master-next branch.