ACK: [SRU][Q][PATCH 0/1] System hangs during stress-ng stack test
Masahiro Yamada
masahiro.yamada at canonical.com
Thu Apr 9 05:27:39 UTC 2026
On 4/7/26 15:34, AceLan Kao wrote:
> From: "Chia-Lin Kao (AceLan)" <acelan.kao at canonical.com>
>
> BugLink: https://bugs.launchpad.net/bugs/2137755
>
> [Impact]
> stress-ng memory stress test fails with stack stressor timeout on Dell
> systems (CID: 202511-38062) running kernel 6.17.0-1007-oem. The stack
> stressor, which creates heavy memory pressure and swap activity,
> consistently times out after running for the expected duration.
>
> The issue occurs because the swap allocator uses an incorrect index when
> retrying swap cache reclaim after encountering a race condition. During
> heavy memory pressure (such as generated by the stack stressor), the
> allocator reclaims cached swap slots while scanning. If it finds a folio
> that's already removed from the swap cache due to a race, it retries - but
> the retry uses the wrong index, which can lead to:
> 1. Reclaiming irrelevant swap folios instead of the intended ones
> 2. Inefficient swap reclaim behavior under memory pressure
> 3. Performance degradation that causes stress tests to timeout
>
> Affected hardware: Dell systems (CID: 202511-38062) with high core count
> and memory configurations
> Failure rate: 100% (2/2 test runs failed)
>
> [Fix]
> Upstream commit a733d8de7f1cc ("mm, swap: fix swap cache index error when
> retrying reclaim") fixes the swap cache index handling.
>
> The fix makes two key changes:
> 1. Makes the `entry` variable const to prevent incorrect reassignment
> 2. Uses `folio->swap` directly when updating the offset after retrying,
> instead of using the stale `entry` variable
>
> This ensures that when the allocator retries after a race condition, it
> uses the correct swap cache index from the locked folio, preventing reclaim
> of irrelevant folios.
>
> The patch is upstream in mainline kernel v6.18 and reviewed by multiple
> memory management maintainers.
>
> Link: https://lkml.kernel.org/r/20250916160100.31545-4-ryncsn@gmail.com
> Fixes: fae859550531 ("mm, swap: avoid reclaiming irrelevant swap cache")
>
> [Test Plan]
> On affected Dell systems (CID: 202511-38062) or similar systems with high
> core count and memory:
>
> 1. Install kernel with the fix
>
> 2. Run the stress test:
> ```
> # Run stress-ng with stack stressor
> stress-ng --aggressive --verify --oom-avoid-bytes 10% --timeout 920 --stack 8
> ```
>
> 3. Monitor the test execution:
> - The test should complete within the expected 920 second timeout
> - Check that stress-ng reports "successful run completed" for the stack
> stressor
>
> Without the patch:
> - stress-ng stack stressor times out and is forcefully terminated
> - System may hang if the stress-ng process fails to be killed
>
> With the patch:
> - stress-ng stack stressor completes within timeout period
>
> 4. Optionally verify swap activity during the test:
> ```
> # Monitor swap usage
> watch -n 1 'free -h && cat /proc/swaps'
> ```
> Swap should be actively used and reclaimed without unusual delays.
>
> [Where problems could occur]
> The changes affect the swap file subsystem's reclaim logic in mm/swapfile.c,
> specifically the __try_to_reclaim_swap() function.
>
> If the fix introduces incorrect behavior:
>
> 1. **Incorrect folio identification**: If `folio->swap` doesn't properly
> reflect the current state after locking, the code might still reclaim the
> wrong folio. However, this is unlikely since the folio is locked and the
> swap entry is validated before use.
>
> 2. **Performance regression**: The change from using a cached `entry` value
> to dereferencing `folio->swap` multiple times could theoretically impact
> performance. However, this should be negligible as the additional
> dereferences only occur in the retry path (race condition case) which is not
> the common case.
>
> 3. **Const qualifier issues**: Making `entry` const prevents reassignment.
> If there were other code paths that relied on reassigning `entry` (not
> visible in the upstream patch), compilation would fail. However, the
> upstream kernel has this change merged and tested.
>
> 4. **Backport conflicts**: The backport required manual resolution because
> the target branch still has an `address_space` variable that was removed
> upstream. If the resolution was incorrect, swap cache lookups could fail.
> However, the resolution preserves the `address_space` variable while
> applying the const qualifier and folio->swap usage as intended.
>
> The impact is limited to swap reclaim behavior under memory pressure. The
> fix makes the code more correct by ensuring the right swap slots are
> reclaimed during races, which should improve rather than degrade stability.
>
>
> Kairui Song (1):
> mm, swap: fix swap cache index error when retrying reclaim
>
> mm/swapfile.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
Acked-by: Masahiro Yamada <masahiro.yamada at canonical.com>
More information about the kernel-team
mailing list