Cmnt: [SRU][N/O/P][PATCH 0/1] MGLRU: page allocation failure on NUMA-enabled systems

Wed Feb 12 12:14:27 UTC 2025

Hi Koichiro,

thanks for looking into this! Yes, I've used the attached scripts to
reproduce the issue successfully, although only in aarch64 systems
(specifically, I've used Grace-Grace for my tests).
I've not been able to reproduce this reliably in x86 or other
architectures, and using 64k page sizes also makes this much faster/easier
to reproduce.

On Wed, Feb 12, 2025 at 1:37 AM Koichiro Den <koichiro.den at canonical.com>
wrote:

> On Sun, Feb 02, 2025 at 12:21:50PM GMT, Heitor Alves de Siqueira wrote:
> > BugLink: https://bugs.launchpad.net/bugs/2097214
> >
> > [Impact]
> >  * On MGLRU-enabled systems, high memory pressure on NUMA nodes will
> cause page
> >    allocation failures
> >  * This happens due to page reclaim not waking up flusher threads
> >  * OOM can be triggered even if the system has enough available memory
> >
> > [Test Plan]
> >  * For the bug to properly trigger, we should uninstall apport and use
> the
> >    attached alloc_and_crash.c reproducer
> >  * alloc_and_crash will mmap a huge range of memory, memset it and
> forcibly SEGFAULT
> >  * The attached bash script will membind alloc_and_crash to NUMA node 0,
> so we
> >    can see the allocation failures in dmesg
> >    $ sudo apt remove --purge apport
> >    $ sudo dmesg -c; ./lp2097214-repro.sh; sleep 2; sudo dmesg
>
> I looked over the attached files (alloc_and_crash.c and
> lp2097214-repro.sh).
>
> Question:
> Did you use them to reproduce the issue that you want to resolve here?
> Also, did you confirm that the issue was resolved after applying the patch
> for Noble/Oracular/Plucky? It seems to me that it's just stressing lru
> list for ANON, not FILE.
>
> >
> > [Fix]
> >  * The upstream patch wakes up flusher threads if there are too many
> dirty
> >    entries in the coldest LRU generation
> >  * This happens when trying to shrink lruvecs, so reclaim only gets
> woken up
> >    during high memory pressure
> >  * Fix was introduced by commit:
> >      1bc542c6a0d1 mm/vmscan: wake up flushers conditionally to avoid
> cgroup OOM
> >
> > [Regression Potential]
> >  * This commit fixes the memory reclaim path, so regressions would
> likely show
> >    up during increased system memory pressure
> >  * According to the upstream patch, increased SSD/disk wearing is
> possible due
> >    to waking up flusher threads, although these have not been noted in
> testing
> >
> > Zeng Jingxiang (1):
> >   mm/vmscan: wake up flushers conditionally to avoid cgroup OOM
> >
> >  mm/vmscan.c | 25 ++++++++++++++++++++++---
> >  1 file changed, 22 insertions(+), 3 deletions(-)
> >
> > --
> > 2.48.1
> >
> >
> > --
> > kernel-team mailing list
> > kernel-team at lists.ubuntu.com
> > https://lists.ubuntu.com/mailman/listinfo/kernel-team
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250212/4d538cce/attachment.html>