ACK: [SRU][N][PATCH 0/4] rcu: Eliminate deadlocks involving do_exit() and RCU tasks

Thu Aug 7 09:05:23 UTC 2025

On 21-07-25 15:44:09, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2117123
> 
> [Impact]
> 
> Tracing tools, such as ebpf fentry programs, can be attached to tasks all the
> way to very late in do_exit(), and because of this, synchronize_rcu_tasks()
> needs to wait for the dying task to finish and the tracer to be removed, even
> though the task is no longer on the task list. This is explained on:
> 
> 3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting")
> 
> > Once a task has passed exit_notify() in the do_exit() code path, it is no 
> > longer on the task lists, and is therefore no longer visible to 
> > rcu_tasks_kthread().
> 
> SRCU was created to handle this issue, to wait for tasks that could still be in
> a critical section, but no longer on the RCU tasks list. Unfortunately, there
> has been a class of deadlocks in do_exit() for years, that have been largely
> ignored, but was recently reproduced by a syzkaller script:
> 
> https://github.com/xupengfe/syzkaller_logs/blob/main/221115_105658_synchronize_rcu/repro.c
> 
> Frederic Weisbecker provides the following analysis:
> 
> 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
>    that every subsequent child of TASK A will belong to. But TASK A doesn't
>    itself belong to that new PID namespace.
> 
> 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
>    thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
>    and TASK B is the first task belonging to the new PID namespace created by
>    unshare()  (let's call it PID_NS2).
> 
> 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
>    child reaper.
> 
> 4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
>    Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
>    TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
> 
> 3) TASK B exits and since it is the child reaper for PID_NS2, it has to
>    kill all other tasks attached to PID_NS2, and wait for all of them to die
>    before reaping itself (zap_pid_ns_process()). Note it seems to make a
>    misleading assumption here, trusting that all tasks in PID_NS2 either
>    get reaped by a parent belonging to the same namespace or by TASK B.
>    And it is confident that since it deactivated SIGCHLD handler, all
>    the remaining tasks ultimately autoreap. And it waits for that to happen.
>    However TASK C escapes that rule because it will get reaped by its parent
>    TASK A belonging to PID_NS1.
> 
> 4) TASK A calls synchronize_rcu_tasks() which leads to
>    synchronize_srcu(&tasks_rcu_exit_srcu).
> 
> 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
>    But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
>    (exit_notify() is between exit_tasks_rcu_start() and
>    exit_tasks_rcu_finish()), blocking TASK A
> 
> 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
>    but it can't because TASK A waits for TASK B that waits for TASK C.
> 
> So there is a circular dependency:
> 
> _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
> section
> _ TASK B waits for TASK C to get reaped
> _ TASK C waits for TASK A to reap it.
> 
> An example stack trace is:
> 
> kernel: INFO: task rcu_tasks_trace:15 blocked for more than 121 seconds.
> kernel:       Not tainted 6.8.0-63-generic #66-Ubuntu
> kernel: task:rcu_tasks_trace state:D stack:0     pid:15    tgid:15    ppid:2      flags:0x00004000
> kernel: Call Trace:
> kernel:  <TASK>
> kernel:  __schedule+0x27c/0x6b0
> kernel:  schedule+0x33/0x110
> kernel:  schedule_timeout+0x157/0x170
> kernel:  wait_for_completion+0x88/0x150
> kernel:  __wait_rcu_gp+0x17e/0x190
> kernel:  synchronize_rcu+0x12d/0x140
> kernel:  ? __pfx_call_rcu_hurry+0x10/0x10
> kernel:  ? __pfx_wakeme_after_rcu+0x10/0x10
> kernel:  rcu_tasks_trace_postscan+0xe/0x20
> kernel:  rcu_tasks_wait_gp+0x119/0x310
> kernel:  ? _raw_spin_lock_irqsave+0xe/0x20
> kernel:  ? rcu_tasks_need_gpcb+0x1f7/0x350
> kernel:  ? __pfx_rcu_tasks_kthread+0x10/0x10
> kernel:  rcu_tasks_one_gp+0x122/0x150
> kernel:  rcu_tasks_kthread+0xa4/0xd0
> kernel:  kthread+0xef/0x120
> kernel:  ? __pfx_kthread+0x10/0x10
> kernel:  ret_from_fork+0x44/0x70
> kernel:  ? __pfx_kthread+0x10/0x10
> kernel:  ret_from_fork_asm+0x1b/0x30
> kernel:  </TASK>
> kernel: task:system-probe    state:D stack:0     pid:1989  tgid:1931  ppid:1926   flags:0x00000002
> kernel: Call Trace:
> kernel:  <TASK>
> kernel:  __schedule+0x27c/0x6b0
> kernel:  schedule+0x33/0x110
> kernel:  schedule_timeout+0x157/0x170
> kernel:  wait_for_completion+0x88/0x150
> kernel:  __wait_rcu_gp+0x17e/0x190
> kernel:  synchronize_rcu_tasks_generic+0x64/0xe0
> kernel:  ? __pfx_call_rcu_tasks_trace+0x10/0x10
> kernel:  ? __pfx_wakeme_after_rcu+0x10/0x10
> kernel:  synchronize_rcu_tasks_trace+0x15/0x20
> kernel:  perf_event_detach_bpf_prog+0x7d/0xe0
> kernel:  _free_event+0x20e/0x2a0
> kernel:  perf_event_release_kernel+0x281/0x2e0
> kernel:  perf_release+0x15/0x30
> kernel:  __fput+0xa0/0x2e0
> kernel:  __fput_sync+0x1c/0x30
> kernel:  __x64_sys_close+0x3e/0x90
> kernel:  x64_sys_call+0x1fec/0x25a0
> kernel:  do_syscall_64+0x7f/0x180
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? filp_flush+0x57/0x90
> kernel:  ? syscall_exit_to_user_mode+0x86/0x260
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? restore_fpregs_from_fpstate+0x3d/0xd0
> kernel:  ? switch_fpu_return+0x55/0xf0
> kernel:  ? filp_flush+0x57/0x90
> kernel:  ? syscall_exit_to_user_mode+0x86/0x260
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? filp_flush+0x57/0x90
> kernel:  ? syscall_exit_to_user_mode+0x86/0x260
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? do_syscall_64+0x8c/0x180
> kernel:  ? irqentry_exit_to_user_mode+0x7b/0x260
> kernel:  ? irqentry_exit+0x43/0x50
> kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80
> 
> [Fix]
> 
> The entire patchset is listed below. 3 out of the 7 have already been applied to
> ubuntu-noble due to being a dependency of another commit. We only need the 4
> missing commits.
> 
> This was mainlined in 6.9-rc1 by the following commits:
> 
> commit 2eb52fa8900e642b3b5054c4bf9776089d2a935f
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date:   Mon Dec 4 09:33:29 2023 -0800
> Subject: rcu-tasks: Repair RCU Tasks Trace quiescence check
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2eb52fa8900e642b3b5054c4bf9776089d2a935f
> Applied: Yes. ubuntu-noble 7e16c7d2a1ee
> 
> commit bfe93930ea1ea3c6c115a7d44af6e4fea609067e
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date:   Mon Feb 5 13:08:22 2024 -0800
> Subject: rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfe93930ea1ea3c6c115a7d44af6e4fea609067e
> Applied: Yes. ubuntu-noble b9014deb33e6
> 
> commit 30ef09635b9ed3ebca4f677495332a2e444a5cda
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date:   Thu Feb 22 12:29:54 2024 -0800
> Subject: rcu-tasks: Initialize callback lists at rcu_init() time
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30ef09635b9ed3ebca4f677495332a2e444a5cda
> Applied. No. Needed.
> 
> commit 46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date:   Mon Feb 5 13:10:19 2024 -0800
> Subject: rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
> Applied: Yes. ubuntu-noble c8da4b0160db
> 
> commit 6b70399f9ef3809f6e308fd99dd78b072c1bd05c
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date:   Fri Feb 2 11:28:45 2024 -0800
> Subject: rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6b70399f9ef3809f6e308fd99dd78b072c1bd05c
> Applied: No. Needed.
> 
> commit 1612160b91272f5b1596f499584d6064bf5be794
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date:   Fri Feb 2 11:49:06 2024 -0800
> Subject: rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1612160b91272f5b1596f499584d6064bf5be794
> Applied: No. Needed.
> 
> commit 0bb11a372fc8d7006b4d0f42a2882939747bdbff
> Author: Paul E. McKenney <paulmck at kernel.org>
> Date:   Thu Feb 1 06:10:26 2024 -0800
> Subject: rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bb11a372fc8d7006b4d0f42a2882939747bdbff
> Applied: No. Needed.
> 
> The 4 needed commits are all clean cherry picks.
> 
> [Testcase]
> 
> To reproduce the do_exit() deadlock using the syzkaller repro:
> 
> $ sudo apt install build-essential
> $ wget https://raw.githubusercontent.com/xupengfe/syzkaller_logs/refs/heads/main/221115_105658_synchronize_rcu/repro.c
> $ gcc -o repro repro.c
> $ sudo ./repro
> $ journalctl -f -t kernel
> 
> Due to the high regression risk of this patchset, we should run rcutorture, the
> rcu test suite, over a patched kernel to ensure there are no deadlocks.
> 
> To run rcutorture on the kernel build:
> 
> Documentation:
> https://docs.kernel.org/RCU/torture.html
> 
> 1) Clone the kernel source code
> 2) Save the following patch to enable CONFIG_RCU_TORTURE_TEST to 
> 0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
> https://launchpadlibrarian.net/805611005/0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
> 3) $ git am 0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
> 4) Build a new kernel with the patch applied, boot into it.
> 5) $ modprobe rcutorture
> 6) Follow dmesg.
> $ journalctl -f -t kernel
> kernel: rcu-torture: rcu_torture_read_exit: Start of episode
> kernel: rcu-torture: rcu_torture_read_exit: End of episode
> kernel: rcu_torture_fwd_prog_nr: 0 Duration 50060 cver 1081 gps 1490
> kernel: rcu_torture_fwd_prog_nr: Waiting for CBs: rcu_barrier+0x0/0x80() 0
> kernel: rcu-torture: rtc: 00000000c099ebf1 ver: 62341 tfle: 0 rta: 62342 rtaf: 0 rtf: 62331 rtmbe: 0 rtmbkf: 0/48597 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 0 nt: 1396993 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=1000) barrier: 0/0:0 read-exits: 1792 nocb-toggles: 0:0
> kernel: rcu-torture: Reader Pipe:  2350715188 99444 0 0 0 0 0 0 0 0 0
> kernel: rcu-torture: Reader Batch:  2350551525 263107 0 0 0 0 0 0 0 0 0
> kernel: rcu-torture: Free-Block Circulation:  62341 62340 62339 62338 62336 62335 62334 62333 62332 62331 0
> 
> Read the documentation and ensure you see "Success" and no "FAILURE" messages.
> Ensure all the values that should be 0 are indeed 0.
> 
> Leave rcutorture running for several hours / days.
> 
> There is a test kernel available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf411904-config
> 
> If you install it, it should not deadlock on the reproducer anymore, and you can
> also load the rcutorture kernel module for regression testing.
> 
> [Where problems could occur]
> 
> We are changing what happens to tasks that are late in do_exit(), and are now
> adding them to a new list to keep track of them while they could be in a RCU
> critical section.
> 
> These are some large changes to the RCU subsystem, and it affects nearly other
> subsystem of the kernel, as RCU is used everywhere.
> 
> If a regression were to occur, it would involve RCU grace periods getting stuck,
> leading to deadlocks and hung task timeouts with no real workarounds.
> 
> We need to ensure we test this change with rcutorture for the whole duration the
> kernel is in -proposed for.
> 
> [Other info]
> 
> Upstream mailing list report:
> https://lore.kernel.org/lkml/Y3sOgrOmMQqPMItu@xpf.sh.intel.com/T/#u
> 
> Paul E. McKenney's architecture document:
> https://docs.google.com/document/d/1hJxgiZ5TMZ4YJkdJPLAkRvq7sYQ-A7svgA8no6i-v8k/edit?usp=sharing
> 
> syzkaller scripts, C reproducer, dmesg logs:
> https://github.com/xupengfe/syzkaller_logs/tree/main/221115_105658_synchronize_rcu
> 
> Upstream mailing list submission:
> https://lore.kernel.org/lkml/20240217012745.3446231-1-boqun.feng@gmail.com/T/#u
> 
> Paul E. McKenney (4):
>   rcu-tasks: Initialize callback lists at rcu_init() time
>   rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
>   rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
>   rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
> 

Acked-by: Cengiz Can <cengiz.can at canonical.com>

>  kernel/rcu/rcu.h   |   6 +++
>  kernel/rcu/tasks.h | 131 ++++++++++++++++++++++++++++++++++-----------
>  kernel/rcu/tiny.c  |   1 +
>  kernel/rcu/tree.c  |   2 +
>  4 files changed, 108 insertions(+), 32 deletions(-)
> 
> -- 
> 2.50.0
> 
> 
> -- 
> kernel-team mailing list
> kernel-team at lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team