APPLIED: [PATCH 0/2][SRU][P] Pytorch reports incorrect GPU memory causing "HIP Out of Memory" errors

Stefan Bader stefan.bader at canonical.com
Wed Aug 20 12:58:14 UTC 2025


On 18.08.25 12:28, You-Sheng Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/2120454
> 
> [ Impact ]
> 
> PyTorch running on an APU may report out of memory if 1/2 system memory
> is reserved for VRAM.
> 
> This can be identified with `rocminfo` output:
> 
> 1. for the default setup, 512MB for VRAM on a 64G RAM AMD Strix Halo
> development board:
> 
> ```
> $ sudo journalctl -b -1 | grep -B1 GTT
> Aug 15 18:47:46 test kernel: [drm] amdgpu: 512M of VRAM memory ready
> Aug 15 18:47:46 test kernel: [drm] amdgpu: 31822M of GTT memory ready.
> $ free -h
>                total used free shared buff/cache available
> Mem: 62Gi 2.1Gi 3.5Gi 44Mi 57Gi 60Gi
> Swap: 0B 0B 0B
> 
> *******
> Agent 2
> *******
>   Name: gfx1151
>   Uuid: GPU-XX
>   Marketing Name: AMD Radeon Graphics
> ...
>    Pool Info:
>     Pool 1
>       Segment: GLOBAL; FLAGS: COARSE GRAINED
>       Size: 32586284(0x1f13a2c) KB
>       Allocatable: TRUE
>       Alloc Granule: 4KB
>       Alloc Recommended Granule:2048KB
>       Alloc Alignment: 4KB
>       Accessible by all: FALSE
> ```
> The pool 1 has 31822 MB = 32586284 KB.
> 
> With dedicated VRAM set to 32GB:
> ```
> $ sudo dmesg | grep -B1 GTT
> [ 3.640984] [drm] amdgpu: 32768M of VRAM memory ready
> [ 3.640986] [drm] amdgpu: 15970M of GTT memory ready.
> $ free -h
>                total used free shared buff/cache available
> Mem: 31Gi 1.6Gi 28Gi 44Mi 1.3Gi 29Gi
> Swap: 0B 0B 0B
> 
> *******
> Agent 2
> *******
>   Name: gfx1151
>   Uuid: GPU-XX
>   Marketing Name: AMD Radeon Graphics
> ...
>    Pool Info:
>     Pool 1
>       Segment: GLOBAL; FLAGS: COARSE GRAINED
>       Size: 16353840(0xf98a30) KB
>       Allocatable: TRUE
>       Alloc Granule: 4KB
>       Alloc Recommended Granule:2048KB
>       Alloc Alignment: 4KB
>       Accessible by all: FALSE
> ```
> In this case, while we have 32768 MB = 33554432 KB VRAM, but the pool 1
> still allocates 15970 MB = 16353840 KB from GTT.
> 
> [ Test Plan ]
> 
> 1. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html
>     to install & setup necessary host environment for ROCm.
> 
> 2. Follow https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#using-docker-with-pytorch-pre-installed
>     to use prebuilt PyTorch image for easy verification. Or, while we
>     need only `rocminfo` to identify the behavior, one may install
>     rocminfo snap instead to minimize the effort.
> 
> 3. Assign more memory to VRAM. On AMD Strix Halo development board, it's
>     in: Device Manager => AMD CBS => NBIO Common Options => GFX
>     Configuration => Dedicated Graphics Memory. On the development board
>     we have 64 GB RAM, and the options available are "High (32 GB)",
>     "Medium (16 GB)", "Minimum (0.5 GB)".
> 
> 4. Use `rocminfo` to identify if the allocation is now switched to VRAM:
> 
> ```
> *******
> Agent 2
> *******
>   Name: gfx1151
>   Uuid: GPU-XX
>   Marketing Name: AMD Radeon Graphics
> ...
>    Pool Info:
>     Pool 1
>       Segment: GLOBAL; FLAGS: COARSE GRAINED
>       Size: 33554432(0x2000000) KB
>       Allocatable: TRUE
>       Alloc Granule: 4KB
>       Alloc Recommended Granule:2048KB
>       Alloc Alignment: 4KB
>       Accessible by all: FALSE
> ```
> With kernel patched, the allocated memory size is now 33554432 KB
> = 32768 MB.
> 
> [ Where problems could occur ]
> 
> This corrects the allocation source as expected.
> 
> [ Other Info ]
> 
> Nominate for Plucky for 6.14, and Noble for oem-6.14.
> 
> Alex Deucher (2):
>    drm/amdkfd: add a new flag to manage where VRAM allocations go
>    drm/amdkfd: use GTT for VRAM on APUs only if GTT is larger
> 
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h              |  5 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c       |  4 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 ++++++++--------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c          |  5 +++++
>   drivers/gpu/drm/amd/amdkfd/kfd_migrate.c         |  2 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c             |  4 ++--
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.h             |  2 +-
>   7 files changed, 24 insertions(+), 14 deletions(-)
> 


Applied to plucky:linux/master-next. Thanks.

-Stefan


-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xE8675DEECBEECEA3.asc
Type: application/pgp-keys
Size: 48643 bytes
Desc: OpenPGP public key
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250820/021b4d01/attachment-0001.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250820/021b4d01/attachment-0001.sig>


More information about the kernel-team mailing list