[PATCH 0/2][SRU][P] Pytorch reports incorrect GPU memory causing "HIP Out of Memory" errors

Mon Aug 18 10:28:26 UTC 2025

BugLink: https://bugs.launchpad.net/bugs/2120454

[ Impact ]

PyTorch running on an APU may report out of memory if 1/2 system memory
is reserved for VRAM.

This can be identified with `rocminfo` output:

1. for the default setup, 512MB for VRAM on a 64G RAM AMD Strix Halo
development board:

```
$ sudo journalctl -b -1 | grep -B1 GTT
Aug 15 18:47:46 test kernel: [drm] amdgpu: 512M of VRAM memory ready
Aug 15 18:47:46 test kernel: [drm] amdgpu: 31822M of GTT memory ready.
$ free -h
              total used free shared buff/cache available
Mem: 62Gi 2.1Gi 3.5Gi 44Mi 57Gi 60Gi
Swap: 0B 0B 0B

*******
Agent 2
*******
 Name: gfx1151
 Uuid: GPU-XX
 Marketing Name: AMD Radeon Graphics
...
  Pool Info:
   Pool 1
     Segment: GLOBAL; FLAGS: COARSE GRAINED
     Size: 32586284(0x1f13a2c) KB
     Allocatable: TRUE
     Alloc Granule: 4KB
     Alloc Recommended Granule:2048KB
     Alloc Alignment: 4KB
     Accessible by all: FALSE
```
The pool 1 has 31822 MB = 32586284 KB.

With dedicated VRAM set to 32GB:
```
$ sudo dmesg | grep -B1 GTT
[ 3.640984] [drm] amdgpu: 32768M of VRAM memory ready
[ 3.640986] [drm] amdgpu: 15970M of GTT memory ready.
$ free -h
              total used free shared buff/cache available
Mem: 31Gi 1.6Gi 28Gi 44Mi 1.3Gi 29Gi
Swap: 0B 0B 0B

*******
Agent 2
*******
 Name: gfx1151
 Uuid: GPU-XX
 Marketing Name: AMD Radeon Graphics
...
  Pool Info:
   Pool 1
     Segment: GLOBAL; FLAGS: COARSE GRAINED
     Size: 16353840(0xf98a30) KB
     Allocatable: TRUE
     Alloc Granule: 4KB
     Alloc Recommended Granule:2048KB
     Alloc Alignment: 4KB
     Accessible by all: FALSE
```
In this case, while we have 32768 MB = 33554432 KB VRAM, but the pool 1
still allocates 15970 MB = 16353840 KB from GTT.

[ Test Plan ]

1. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
   to install & setup necessary host environment for ROCm.

2. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-docker-with-pytorch-pre-installed
   to use prebuilt PyTorch image for easy verification. Or, while we
   need only `rocminfo` to identify the behavior, one may install
   rocminfo snap instead to minimize the effort.

3. Assign more memory to VRAM. On AMD Strix Halo development board, it's
   in: Device Manager => AMD CBS => NBIO Common Options => GFX
   Configuration => Dedicated Graphics Memory. On the development board
   we have 64 GB RAM, and the options available are "High (32 GB)",
   "Medium (16 GB)", "Minimum (0.5 GB)".

4. Use `rocminfo` to identify if the allocation is now switched to VRAM:

```
*******
Agent 2
*******
 Name: gfx1151
 Uuid: GPU-XX
 Marketing Name: AMD Radeon Graphics
...
  Pool Info:
   Pool 1
     Segment: GLOBAL; FLAGS: COARSE GRAINED
     Size: 33554432(0x2000000) KB
     Allocatable: TRUE
     Alloc Granule: 4KB
     Alloc Recommended Granule:2048KB
     Alloc Alignment: 4KB
     Accessible by all: FALSE
```
With kernel patched, the allocated memory size is now 33554432 KB
= 32768 MB.

[ Where problems could occur ]

This corrects the allocation source as expected.

[ Other Info ]

Nominate for Plucky for 6.14, and Noble for oem-6.14.

Alex Deucher (2):
  drm/amdkfd: add a new flag to manage where VRAM allocations go
  drm/amdkfd: use GTT for VRAM on APUs only if GTT is larger

 drivers/gpu/drm/amd/amdgpu/amdgpu.h              |  5 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c       |  4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 ++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  5 +++++
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c         |  2 +-
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c             |  4 ++--
 drivers/gpu/drm/amd/amdkfd/kfd_svm.h             |  2 +-
 7 files changed, 24 insertions(+), 14 deletions(-)

-- 
2.50.0