[SRU][P][PATCH 0/6] vfio: Improve DMA mapping performance for huge pfnmaps

Wed May 28 22:10:45 UTC 2025

BugLink: https://bugs.launchpad.net/bugs/2111861

SRU Justification:

[ Impact ]

Due to an inefficiency in the way older host kernels manage pfnmaps for guest VM memory ranges[1], guests with large-BAR GPUs passed through have a very long (multiple minutes) initialization time when the MMIO window advertised by OVMF is sufficiently sized for the passed-through BARs (i.e., the correct OVMF behavior).

We have already integrated a partial efficiency improvement [2] which is transparent to the user in 6.8+ kernels, as well as an OVMF-based approach to allow the user to force Jammy-like, faster boot speeds via fw_ctl [3], but the approach in the patch series outlined in this report is the full fix for the underlying cause of the issue on kernels that have support for huge pfnmaps.

With this series [0] applied to both the host and guest of an impacted system, BAR initialization times are reduced substantially: In the commonly achieved optimal case, this results in a reduction of pfn lookups by a factor of 256k. For a local test system, an overhead of ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M page sized operations reduced to 32 pud sized operations).

[ Test Plan ]

On a machine with GPUs with sufficiently sized BARs:
1. Create a virtual machine with 4 GPUs passed through and CPU host-passthrough enabled. (We use DGX H100 or A100, typically)
2. Observe that, on an unaltered 6.14 kernel, the VM boot time exceeds 5 minutes
3. After applying this series to both the host and guest kernels (applied in ppa:mitchellaugustin/pcihugepfnmapfixes-plucky-kernel [4]), boot the guest and observe that the VM boot time is under 30 seconds, with the BAR initialization steps occurring significantly faster in dmesg output.

I have verified this with the series applied to both the plucky kernel and the linux-nvidia-6.14
kernel on DGX H100

[ Fix ]

This series attempts to fully address the issue by leveraging the huge
pfnmap support added in v6.12. When we insert pfnmaps using pud and pmd
mappings, we can later take advantage of the knowledge of the mapping
level page mask to iterate on the relevant mapping stride.

[ Where problems could occur ]

I do not expect any regressions. The only callers of ABIs changed by this series are also adjusted within this series.

[ Additional Context ]

[0]: https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/
[1]: https://lore.kernel.org/all/CAHTA-uYp07FgM6T1OZQKqAdSA5JrZo0ReNEyZgQZub4mDRrV5w@mail.gmail.com/
[2]: https://bugs.launchpad.net/bugs/2097389
[3]: https://bugs.launchpad.net/bugs/2101903
[4]: https://launchpad.net/~mitchellaugustin/+archive/ubuntu/pcihugepfnmapfixes-plucky-kernel/

Alex Williamson (6):
  mm: Provide address mask in struct follow_pfnmap_args
  vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
  vfio/type1: Catch zero from pin_user_pages_remote()
  vfio/type1: Use vfio_batch for vaddr_get_pfns()
  vfio/type1: Use consistent types for page counts
  vfio/type1: Use mapping page mask for pfnmaps

 drivers/vfio/vfio_iommu_type1.c | 123 ++++++++++++++++++++------------
 include/linux/mm.h              |   2 +
 mm/memory.c                     |   1 +
 3 files changed, 80 insertions(+), 46 deletions(-)

-- 
2.43.0