[PATCH 0/2][SRU][P] Pytorch reports incorrect GPU memory causing "HIP Out of Memory" errors
You-Sheng Yang
vicamo.yang at canonical.com
Mon Aug 18 10:28:26 UTC 2025
BugLink: https://bugs.launchpad.net/bugs/2120454
[ Impact ]
PyTorch running on an APU may report out of memory if 1/2 system memory
is reserved for VRAM.
This can be identified with `rocminfo` output:
1. for the default setup, 512MB for VRAM on a 64G RAM AMD Strix Halo
development board:
```
$ sudo journalctl -b -1 | grep -B1 GTT
Aug 15 18:47:46 test kernel: [drm] amdgpu: 512M of VRAM memory ready
Aug 15 18:47:46 test kernel: [drm] amdgpu: 31822M of GTT memory ready.
$ free -h
total used free shared buff/cache available
Mem: 62Gi 2.1Gi 3.5Gi 44Mi 57Gi 60Gi
Swap: 0B 0B 0B
*******
Agent 2
*******
Name: gfx1151
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32586284(0x1f13a2c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
```
The pool 1 has 31822 MB = 32586284 KB.
With dedicated VRAM set to 32GB:
```
$ sudo dmesg | grep -B1 GTT
[ 3.640984] [drm] amdgpu: 32768M of VRAM memory ready
[ 3.640986] [drm] amdgpu: 15970M of GTT memory ready.
$ free -h
total used free shared buff/cache available
Mem: 31Gi 1.6Gi 28Gi 44Mi 1.3Gi 29Gi
Swap: 0B 0B 0B
*******
Agent 2
*******
Name: gfx1151
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16353840(0xf98a30) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
```
In this case, while we have 32768 MB = 33554432 KB VRAM, but the pool 1
still allocates 15970 MB = 16353840 KB from GTT.
[ Test Plan ]
1. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
to install & setup necessary host environment for ROCm.
2. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-docker-with-pytorch-pre-installed
to use prebuilt PyTorch image for easy verification. Or, while we
need only `rocminfo` to identify the behavior, one may install
rocminfo snap instead to minimize the effort.
3. Assign more memory to VRAM. On AMD Strix Halo development board, it's
in: Device Manager => AMD CBS => NBIO Common Options => GFX
Configuration => Dedicated Graphics Memory. On the development board
we have 64 GB RAM, and the options available are "High (32 GB)",
"Medium (16 GB)", "Minimum (0.5 GB)".
4. Use `rocminfo` to identify if the allocation is now switched to VRAM:
```
*******
Agent 2
*******
Name: gfx1151
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
...
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33554432(0x2000000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
```
With kernel patched, the allocated memory size is now 33554432 KB
= 32768 MB.
[ Where problems could occur ]
This corrects the allocation source as expected.
[ Other Info ]
Nominate for Plucky for 6.14, and Noble for oem-6.14.
Alex Deucher (2):
drm/amdkfd: add a new flag to manage where VRAM allocations go
drm/amdkfd: use GTT for VRAM on APUs only if GTT is larger
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 5 +++++
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 4 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 ++++++++--------
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 5 +++++
drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 2 +-
drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 4 ++--
drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 +-
7 files changed, 24 insertions(+), 14 deletions(-)
--
2.50.0
More information about the kernel-team
mailing list