ACK: [SRU][N][PATCH 0/4] rcu: Eliminate deadlocks involving do_exit() and RCU tasks
Cengiz Can
cengiz.can at canonical.com
Thu Aug 7 09:05:23 UTC 2025
On 21-07-25 15:44:09, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2117123
>
> [Impact]
>
> Tracing tools, such as ebpf fentry programs, can be attached to tasks all the
> way to very late in do_exit(), and because of this, synchronize_rcu_tasks()
> needs to wait for the dying task to finish and the tracer to be removed, even
> though the task is no longer on the task list. This is explained on:
>
> 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting")
>
> > Once a task has passed exit_notify() in the do_exit() code path, it is no
> > longer on the task lists, and is therefore no longer visible to
> > rcu_tasks_kthread().
>
> SRCU was created to handle this issue, to wait for tasks that could still be in
> a critical section, but no longer on the RCU tasks list. Unfortunately, there
> has been a class of deadlocks in do_exit() for years, that have been largely
> ignored, but was recently reproduced by a syzkaller script:
>
> https://github.com/xupengfe/syzkaller_logs/blob/main/221115_105658_synchronize_rcu/repro.c
>
> Frederic Weisbecker provides the following analysis:
>
> 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
> that every subsequent child of TASK A will belong to. But TASK A doesn't
> itself belong to that new PID namespace.
>
> 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
> thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
> and TASK B is the first task belonging to the new PID namespace created by
> unshare() (let's call it PID_NS2).
>
> 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
> child reaper.
>
> 4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
> Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
> TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
>
> 3) TASK B exits and since it is the child reaper for PID_NS2, it has to
> kill all other tasks attached to PID_NS2, and wait for all of them to die
> before reaping itself (zap_pid_ns_process()). Note it seems to make a
> misleading assumption here, trusting that all tasks in PID_NS2 either
> get reaped by a parent belonging to the same namespace or by TASK B.
> And it is confident that since it deactivated SIGCHLD handler, all
> the remaining tasks ultimately autoreap. And it waits for that to happen.
> However TASK C escapes that rule because it will get reaped by its parent
> TASK A belonging to PID_NS1.
>
> 4) TASK A calls synchronize_rcu_tasks() which leads to
> synchronize_srcu(&tasks_rcu_exit_srcu).
>
> 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
> But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
> (exit_notify() is between exit_tasks_rcu_start() and
> exit_tasks_rcu_finish()), blocking TASK A
>
> 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
> but it can't because TASK A waits for TASK B that waits for TASK C.
>
> So there is a circular dependency:
>
> _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
> section
> _ TASK B waits for TASK C to get reaped
> _ TASK C waits for TASK A to reap it.
>
> An example stack trace is:
>
> kernel: INFO: task rcu_tasks_trace:15 blocked for more than 121 seconds.
> kernel: Not tainted 6.8.0-63-generic #66-Ubuntu
> kernel: task:rcu_tasks_trace state:D stack:0 pid:15 tgid:15 ppid:2 flags:0x00004000
> kernel: Call Trace:
> kernel: <TASK>
> kernel: __schedule+0x27c/0x6b0
> kernel: schedule+0x33/0x110
> kernel: schedule_timeout+0x157/0x170
> kernel: wait_for_completion+0x88/0x150
> kernel: __wait_rcu_gp+0x17e/0x190
> kernel: synchronize_rcu+0x12d/0x140
> kernel: ? __pfx_call_rcu_hurry+0x10/0x10
> kernel: ? __pfx_wakeme_after_rcu+0x10/0x10
> kernel: rcu_tasks_trace_postscan+0xe/0x20
> kernel: rcu_tasks_wait_gp+0x119/0x310
> kernel: ? _raw_spin_lock_irqsave+0xe/0x20
> kernel: ? rcu_tasks_need_gpcb+0x1f7/0x350
> kernel: ? __pfx_rcu_tasks_kthread+0x10/0x10
> kernel: rcu_tasks_one_gp+0x122/0x150
> kernel: rcu_tasks_kthread+0xa4/0xd0
> kernel: kthread+0xef/0x120
> kernel: ? __pfx_kthread+0x10/0x10
> kernel: ret_from_fork+0x44/0x70
> kernel: ? __pfx_kthread+0x10/0x10
> kernel: ret_from_fork_asm+0x1b/0x30
> kernel: </TASK>
> kernel: task:system-probe state:D stack:0 pid:1989 tgid:1931 ppid:1926 flags:0x00000002
> kernel: Call Trace:
> kernel: <TASK>
> kernel: __schedule+0x27c/0x6b0
> kernel: schedule+0x33/0x110
> kernel: schedule_timeout+0x157/0x170
> kernel: wait_for_completion+0x88/0x150
> kernel: __wait_rcu_gp+0x17e/0x190
> kernel: synchronize_rcu_tasks_generic+0x64/0xe0
> kernel: ? __pfx_call_rcu_tasks_trace+0x10/0x10
> kernel: ? __pfx_wakeme_after_rcu+0x10/0x10
> kernel: synchronize_rcu_tasks_trace+0x15/0x20
> kernel: perf_event_detach_bpf_prog+0x7d/0xe0
> kernel: _free_event+0x20e/0x2a0
> kernel: perf_event_release_kernel+0x281/0x2e0
> kernel: perf_release+0x15/0x30
> kernel: __fput+0xa0/0x2e0
> kernel: __fput_sync+0x1c/0x30
> kernel: __x64_sys_close+0x3e/0x90
> kernel: x64_sys_call+0x1fec/0x25a0
> kernel: do_syscall_64+0x7f/0x180
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? filp_flush+0x57/0x90
> kernel: ? syscall_exit_to_user_mode+0x86/0x260
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? restore_fpregs_from_fpstate+0x3d/0xd0
> kernel: ? switch_fpu_return+0x55/0xf0
> kernel: ? filp_flush+0x57/0x90
> kernel: ? syscall_exit_to_user_mode+0x86/0x260
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? filp_flush+0x57/0x90
> kernel: ? syscall_exit_to_user_mode+0x86/0x260
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? do_syscall_64+0x8c/0x180
> kernel: ? irqentry_exit_to_user_mode+0x7b/0x260
> kernel: ? irqentry_exit+0x43/0x50
> kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80
>
> [Fix]
>
> The entire patchset is listed below. 3 out of the 7 have already been applied to
> ubuntu-noble due to being a dependency of another commit. We only need the 4
> missing commits.
>
> This was mainlined in 6.9-rc1 by the following commits:
>
> commit 2eb52fa8900e642b3b5054c4bf9776089d2a935f
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date: Mon Dec 4 09:33:29 2023 -0800
> Subject: rcu-tasks: Repair RCU Tasks Trace quiescence check
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2eb52fa8900e642b3b5054c4bf9776089d2a935f
> Applied: Yes. ubuntu-noble 7e16c7d2a1ee
>
> commit bfe93930ea1ea3c6c115a7d44af6e4fea609067e
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date: Mon Feb 5 13:08:22 2024 -0800
> Subject: rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfe93930ea1ea3c6c115a7d44af6e4fea609067e
> Applied: Yes. ubuntu-noble b9014deb33e6
>
> commit 30ef09635b9ed3ebca4f677495332a2e444a5cda
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date: Thu Feb 22 12:29:54 2024 -0800
> Subject: rcu-tasks: Initialize callback lists at rcu_init() time
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30ef09635b9ed3ebca4f677495332a2e444a5cda
> Applied. No. Needed.
>
> commit 46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date: Mon Feb 5 13:10:19 2024 -0800
> Subject: rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
> Applied: Yes. ubuntu-noble c8da4b0160db
>
> commit 6b70399f9ef3809f6e308fd99dd78b072c1bd05c
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date: Fri Feb 2 11:28:45 2024 -0800
> Subject: rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6b70399f9ef3809f6e308fd99dd78b072c1bd05c
> Applied: No. Needed.
>
> commit 1612160b91272f5b1596f499584d6064bf5be794
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date: Fri Feb 2 11:49:06 2024 -0800
> Subject: rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1612160b91272f5b1596f499584d6064bf5be794
> Applied: No. Needed.
>
> commit 0bb11a372fc8d7006b4d0f42a2882939747bdbff
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date: Thu Feb 1 06:10:26 2024 -0800
> Subject: rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bb11a372fc8d7006b4d0f42a2882939747bdbff
> Applied: No. Needed.
>
> The 4 needed commits are all clean cherry picks.
>
> [Testcase]
>
> To reproduce the do_exit() deadlock using the syzkaller repro:
>
> $ sudo apt install build-essential
> $ wget https://raw.githubusercontent.com/xupengfe/syzkaller_logs/refs/heads/main/221115_105658_synchronize_rcu/repro.c
> $ gcc -o repro repro.c
> $ sudo ./repro
> $ journalctl -f -t kernel
>
> Due to the high regression risk of this patchset, we should run rcutorture, the
> rcu test suite, over a patched kernel to ensure there are no deadlocks.
>
> To run rcutorture on the kernel build:
>
> Documentation:
> https://docs.kernel.org/RCU/torture.html
>
> 1) Clone the kernel source code
> 2) Save the following patch to enable CONFIG_RCU_TORTURE_TEST to
> 0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
> https://launchpadlibrarian.net/805611005/0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
> 3) $ git am 0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
> 4) Build a new kernel with the patch applied, boot into it.
> 5) $ modprobe rcutorture
> 6) Follow dmesg.
> $ journalctl -f -t kernel
> kernel: rcu-torture: rcu_torture_read_exit: Start of episode
> kernel: rcu-torture: rcu_torture_read_exit: End of episode
> kernel: rcu_torture_fwd_prog_nr: 0 Duration 50060 cver 1081 gps 1490
> kernel: rcu_torture_fwd_prog_nr: Waiting for CBs: rcu_barrier+0x0/0x80() 0
> kernel: rcu-torture: rtc: 00000000c099ebf1 ver: 62341 tfle: 0 rta: 62342 rtaf: 0 rtf: 62331 rtmbe: 0 rtmbkf: 0/48597 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 0 nt: 1396993 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=1000) barrier: 0/0:0 read-exits: 1792 nocb-toggles: 0:0
> kernel: rcu-torture: Reader Pipe: 2350715188 99444 0 0 0 0 0 0 0 0 0
> kernel: rcu-torture: Reader Batch: 2350551525 263107 0 0 0 0 0 0 0 0 0
> kernel: rcu-torture: Free-Block Circulation: 62341 62340 62339 62338 62336 62335 62334 62333 62332 62331 0
>
> Read the documentation and ensure you see "Success" and no "FAILURE" messages.
> Ensure all the values that should be 0 are indeed 0.
>
> Leave rcutorture running for several hours / days.
>
> There is a test kernel available in the following ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf411904-config
>
> If you install it, it should not deadlock on the reproducer anymore, and you can
> also load the rcutorture kernel module for regression testing.
>
> [Where problems could occur]
>
> We are changing what happens to tasks that are late in do_exit(), and are now
> adding them to a new list to keep track of them while they could be in a RCU
> critical section.
>
> These are some large changes to the RCU subsystem, and it affects nearly other
> subsystem of the kernel, as RCU is used everywhere.
>
> If a regression were to occur, it would involve RCU grace periods getting stuck,
> leading to deadlocks and hung task timeouts with no real workarounds.
>
> We need to ensure we test this change with rcutorture for the whole duration the
> kernel is in -proposed for.
>
> [Other info]
>
> Upstream mailing list report:
> https://lore.kernel.org/lkml/Y3sOgrOmMQqPMItu@xpf.sh.intel.com/T/#u
>
> Paul E. McKenney's architecture document:
> https://docs.google.com/document/d/1hJxgiZ5TMZ4YJkdJPLAkRvq7sYQ-A7svgA8no6i-v8k/edit?usp=sharing
>
> syzkaller scripts, C reproducer, dmesg logs:
> https://github.com/xupengfe/syzkaller_logs/tree/main/221115_105658_synchronize_rcu
>
> Upstream mailing list submission:
> https://lore.kernel.org/lkml/20240217012745.3446231-1-boqun.feng@gmail.com/T/#u
>
> Paul E. McKenney (4):
> rcu-tasks: Initialize callback lists at rcu_init() time
> rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
> rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
> rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
>
Acked-by: Cengiz Can <cengiz.can at canonical.com>
> kernel/rcu/rcu.h | 6 +++
> kernel/rcu/tasks.h | 131 ++++++++++++++++++++++++++++++++++-----------
> kernel/rcu/tiny.c | 1 +
> kernel/rcu/tree.c | 2 +
> 4 files changed, 108 insertions(+), 32 deletions(-)
>
> --
> 2.50.0
>
>
> --
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
More information about the kernel-team
mailing list