[SRU][Jammy][PATCH 0/1] isolcpus are ignored when using cgroups V2, causing processes to have wrong affinity

Wed Aug 14 03:41:41 UTC 2024

BugLink: https://bugs.launchpad.net/bugs/2076957

[Impact]

In latency sensitive environments, it is very common to use isolcpus to reserve
a set of cpus that no other processes are to be placed on, and run just dpdk in
poll mode.

There is a bug in the jammy kernel, where if cgroups V2 are enabled, after
several minutes the kernel will place other processes onto these reserved
isolcpus at random. This disturbs dpdk and introduces latency.

The issue does not occur with cgroups V1, so a workaround is to use cgroups V1
instead of V2 for the moment.

[Fix]

I arrived at this commit after a full git bisect, which fixes the issue. It
landed in 6.2-rc1:

commit 7fd4da9c1584be97ffbc40e600a19cb469fd4e78
Author: Waiman Long <longman at redhat.com>
Date:   Sat Nov 12 17:19:39 2022 -0500
Subject: cgroup/cpuset: Optimize cpuset_attach() on v2
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fd4da9c1584be97ffbc40e600a19cb469fd4e78

Only the 5.15 Jammy kernel needs this fix. Focal works correctly as is.

The commit skips calls to cpuset_attach() if the underlying cpusets or memory
have not changed in a cgroup, and it seems to fix the issue.

[Testcase]

Deploy a bare metal server, ideally with a number of cores, 56 should be plenty.
Use Jammy, with the 5.15 GA kernel.

1) Edit /etc/default/grub and set GRUB_CMDLINE_LINUX_DEFAULT to have
"isolcpus=4-7,32-35 rcu_nocb_poll rcu_nocbs=4-7,32-35 systemd.unified_cgroup_hierarchy=1"
2) sudo reboot
3) sudo cat /sys/devices/system/cpu/isolated
4-7,32-35
4) sudo apt install s-tui stress
5) sudo s-tui
6) htop
7) $ while true; do sudo ps -eLF | head -n 1; sudo ps -eLF | grep stress | awk -v a="4" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="5" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="6" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="7" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="32" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="33" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="34" '$9 == a {print;}'; sudo ps -eLF | grep stress | awk -v a="35" '$9 == a {print;}'; sleep 5; done

Setup isolcpus to separate off 4-7 and 32-35, so each NUMA node has a set of
isolated CPUs. 

s-tui is a great frontend for stress, and it starts stress processes. All stress
processes should initially be on non-isolated CPUs, confirm this with htop, that
4-7 and 32-25 are at 0% while every other cpu is at 100%. 

After 3 minutes, but sometimes it takes up to 10 minutes, a stress process, or
the s-tui process will be incorrectly placed onto an isolated cpu, causing it to
increase in usage in htop. The while script checking ps with cpu affinities will
also likely be printing the incorrectly placed process.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf391137-test

If you install it, the processes will not be placed onto the isolated cpus.

[Where problems could occur]

The patch changes how cgroups determines when cpuset_attach() should be called.
cpuset_attach() is currently called very frequently in the 5.15 Jammy kernel,
but most operations should be NOP due to no changes occurring in cpusets or
memory in the cgroup the process is attached to. We are changing it to instead
skip calling cpuset_attach() if there are no changes, which should offer a small
performance increase, as well as fixing this isolcpus bug.

If a regression were to occur, it would affect cgroups V2 only, and it could
cause resource limits to be applied incorrectly in the worst case.

Waiman Long (1):
  cgroup/cpuset: Optimize cpuset_attach() on v2

 kernel/cgroup/cpuset.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

-- 
2.45.2