[SRU][Q][PATCH 0/1] System hangs during stress-ng stack test
AceLan Kao
acelan.kao at canonical.com
Tue Apr 7 06:34:58 UTC 2026
From: "Chia-Lin Kao (AceLan)" <acelan.kao at canonical.com>
BugLink: https://bugs.launchpad.net/bugs/2137755
[Impact]
stress-ng memory stress test fails with stack stressor timeout on Dell
systems (CID: 202511-38062) running kernel 6.17.0-1007-oem. The stack
stressor, which creates heavy memory pressure and swap activity,
consistently times out after running for the expected duration.
The issue occurs because the swap allocator uses an incorrect index when
retrying swap cache reclaim after encountering a race condition. During
heavy memory pressure (such as generated by the stack stressor), the
allocator reclaims cached swap slots while scanning. If it finds a folio
that's already removed from the swap cache due to a race, it retries - but
the retry uses the wrong index, which can lead to:
1. Reclaiming irrelevant swap folios instead of the intended ones
2. Inefficient swap reclaim behavior under memory pressure
3. Performance degradation that causes stress tests to timeout
Affected hardware: Dell systems (CID: 202511-38062) with high core count
and memory configurations
Failure rate: 100% (2/2 test runs failed)
[Fix]
Upstream commit a733d8de7f1cc ("mm, swap: fix swap cache index error when
retrying reclaim") fixes the swap cache index handling.
The fix makes two key changes:
1. Makes the `entry` variable const to prevent incorrect reassignment
2. Uses `folio->swap` directly when updating the offset after retrying,
instead of using the stale `entry` variable
This ensures that when the allocator retries after a race condition, it
uses the correct swap cache index from the locked folio, preventing reclaim
of irrelevant folios.
The patch is upstream in mainline kernel v6.18 and reviewed by multiple
memory management maintainers.
Link: https://lkml.kernel.org/r/20250916160100.31545-4-ryncsn@gmail.com
Fixes: fae859550531 ("mm, swap: avoid reclaiming irrelevant swap cache")
[Test Plan]
On affected Dell systems (CID: 202511-38062) or similar systems with high
core count and memory:
1. Install kernel with the fix
2. Run the stress test:
```
# Run stress-ng with stack stressor
stress-ng --aggressive --verify --oom-avoid-bytes 10% --timeout 920 --stack 8
```
3. Monitor the test execution:
- The test should complete within the expected 920 second timeout
- Check that stress-ng reports "successful run completed" for the stack
stressor
Without the patch:
- stress-ng stack stressor times out and is forcefully terminated
- System may hang if the stress-ng process fails to be killed
With the patch:
- stress-ng stack stressor completes within timeout period
4. Optionally verify swap activity during the test:
```
# Monitor swap usage
watch -n 1 'free -h && cat /proc/swaps'
```
Swap should be actively used and reclaimed without unusual delays.
[Where problems could occur]
The changes affect the swap file subsystem's reclaim logic in mm/swapfile.c,
specifically the __try_to_reclaim_swap() function.
If the fix introduces incorrect behavior:
1. **Incorrect folio identification**: If `folio->swap` doesn't properly
reflect the current state after locking, the code might still reclaim the
wrong folio. However, this is unlikely since the folio is locked and the
swap entry is validated before use.
2. **Performance regression**: The change from using a cached `entry` value
to dereferencing `folio->swap` multiple times could theoretically impact
performance. However, this should be negligible as the additional
dereferences only occur in the retry path (race condition case) which is not
the common case.
3. **Const qualifier issues**: Making `entry` const prevents reassignment.
If there were other code paths that relied on reassigning `entry` (not
visible in the upstream patch), compilation would fail. However, the
upstream kernel has this change merged and tested.
4. **Backport conflicts**: The backport required manual resolution because
the target branch still has an `address_space` variable that was removed
upstream. If the resolution was incorrect, swap cache lookups could fail.
However, the resolution preserves the `address_space` variable while
applying the const qualifier and folio->swap usage as intended.
The impact is limited to swap reclaim behavior under memory pressure. The
fix makes the code more correct by ensuring the right swap slots are
reclaimed during races, which should improve rather than degrade stability.
Kairui Song (1):
mm, swap: fix swap cache index error when retrying reclaim
mm/swapfile.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--
2.53.0
More information about the kernel-team
mailing list