ACK: [PATCH 0/2][SRU][P] Pytorch reports incorrect GPU memory causing "HIP Out of Memory" errors

Wen-chien Jesse Sung jesse.sung at canonical.com
Tue Aug 19 03:59:31 UTC 2025


You-Sheng Yang <vicamo.yang at canonical.com> writes:

> BugLink: https://bugs.launchpad.net/bugs/2120454
>
> [ Impact ]
>
> PyTorch running on an APU may report out of memory if 1/2 system memory
> is reserved for VRAM.
>
> This can be identified with `rocminfo` output:
>
> 1. for the default setup, 512MB for VRAM on a 64G RAM AMD Strix Halo
> development board:
>
> ```
> $ sudo journalctl -b -1 | grep -B1 GTT
> Aug 15 18:47:46 test kernel: [drm] amdgpu: 512M of VRAM memory ready
> Aug 15 18:47:46 test kernel: [drm] amdgpu: 31822M of GTT memory ready.
> $ free -h
>               total used free shared buff/cache available
> Mem: 62Gi 2.1Gi 3.5Gi 44Mi 57Gi 60Gi
> Swap: 0B 0B 0B
>
> *******
> Agent 2
> *******
>  Name: gfx1151
>  Uuid: GPU-XX
>  Marketing Name: AMD Radeon Graphics
> ...
>   Pool Info:
>    Pool 1
>      Segment: GLOBAL; FLAGS: COARSE GRAINED
>      Size: 32586284(0x1f13a2c) KB
>      Allocatable: TRUE
>      Alloc Granule: 4KB
>      Alloc Recommended Granule:2048KB
>      Alloc Alignment: 4KB
>      Accessible by all: FALSE
> ```
> The pool 1 has 31822 MB = 32586284 KB.
>
> With dedicated VRAM set to 32GB:
> ```
> $ sudo dmesg | grep -B1 GTT
> [ 3.640984] [drm] amdgpu: 32768M of VRAM memory ready
> [ 3.640986] [drm] amdgpu: 15970M of GTT memory ready.
> $ free -h
>               total used free shared buff/cache available
> Mem: 31Gi 1.6Gi 28Gi 44Mi 1.3Gi 29Gi
> Swap: 0B 0B 0B
>
> *******
> Agent 2
> *******
>  Name: gfx1151
>  Uuid: GPU-XX
>  Marketing Name: AMD Radeon Graphics
> ...
>   Pool Info:
>    Pool 1
>      Segment: GLOBAL; FLAGS: COARSE GRAINED
>      Size: 16353840(0xf98a30) KB
>      Allocatable: TRUE
>      Alloc Granule: 4KB
>      Alloc Recommended Granule:2048KB
>      Alloc Alignment: 4KB
>      Accessible by all: FALSE
> ```
> In this case, while we have 32768 MB = 33554432 KB VRAM, but the pool 1
> still allocates 15970 MB = 16353840 KB from GTT.
>
> [ Test Plan ]
>
> 1. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
>    to install & setup necessary host environment for ROCm.
>
> 2. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-docker-with-pytorch-pre-installed
>    to use prebuilt PyTorch image for easy verification. Or, while we
>    need only `rocminfo` to identify the behavior, one may install
>    rocminfo snap instead to minimize the effort.
>
> 3. Assign more memory to VRAM. On AMD Strix Halo development board, it's
>    in: Device Manager => AMD CBS => NBIO Common Options => GFX
>    Configuration => Dedicated Graphics Memory. On the development board
>    we have 64 GB RAM, and the options available are "High (32 GB)",
>    "Medium (16 GB)", "Minimum (0.5 GB)".
>
> 4. Use `rocminfo` to identify if the allocation is now switched to VRAM:
>
> ```
> *******
> Agent 2
> *******
>  Name: gfx1151
>  Uuid: GPU-XX
>  Marketing Name: AMD Radeon Graphics
> ...
>   Pool Info:
>    Pool 1
>      Segment: GLOBAL; FLAGS: COARSE GRAINED
>      Size: 33554432(0x2000000) KB
>      Allocatable: TRUE
>      Alloc Granule: 4KB
>      Alloc Recommended Granule:2048KB
>      Alloc Alignment: 4KB
>      Accessible by all: FALSE
> ```
> With kernel patched, the allocated memory size is now 33554432 KB
> = 32768 MB.
>
> [ Where problems could occur ]
>
> This corrects the allocation source as expected.
>
> [ Other Info ]
>
> Nominate for Plucky for 6.14, and Noble for oem-6.14.
>
> Alex Deucher (2):
>   drm/amdkfd: add a new flag to manage where VRAM allocations go
>   drm/amdkfd: use GTT for VRAM on APUs only if GTT is larger
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h              |  5 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c       |  4 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 ++++++++--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  5 +++++
>  drivers/gpu/drm/amd/amdkfd/kfd_migrate.c         |  2 +-
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c             |  4 ++--
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.h             |  2 +-
>  7 files changed, 24 insertions(+), 14 deletions(-)
>
> -- 
> 2.50.0
>
>
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team

Acked-by: Wen-chien Jesse Sung <jesse.sung at canonical.com>



More information about the kernel-team mailing list