[SRU][N][PATCH 0/4] rcu: Eliminate deadlocks involving do_exit() and RCU tasks

Mon Jul 21 03:44:09 UTC 2025

BugLink: https://bugs.launchpad.net/bugs/2117123

[Impact]

Tracing tools, such as ebpf fentry programs, can be attached to tasks all the
way to very late in do_exit(), and because of this, synchronize_rcu_tasks()
needs to wait for the dying task to finish and the tracer to be removed, even
though the task is no longer on the task list. This is explained on:

3f95aa81d265 ("rcu: Make TASKS_RCU handle tasks that are almost done exiting")

> Once a task has passed exit_notify() in the do_exit() code path, it is no 
> longer on the task lists, and is therefore no longer visible to 
> rcu_tasks_kthread().

SRCU was created to handle this issue, to wait for tasks that could still be in
a critical section, but no longer on the RCU tasks list. Unfortunately, there
has been a class of deadlocks in do_exit() for years, that have been largely
ignored, but was recently reproduced by a syzkaller script:

https://github.com/xupengfe/syzkaller_logs/blob/main/221115_105658_synchronize_rcu/repro.c

Frederic Weisbecker provides the following analysis:

1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
   that every subsequent child of TASK A will belong to. But TASK A doesn't
   itself belong to that new PID namespace.

2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
   thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
   and TASK B is the first task belonging to the new PID namespace created by
   unshare()  (let's call it PID_NS2).

3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
   child reaper.

4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
   Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
   TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.

3) TASK B exits and since it is the child reaper for PID_NS2, it has to
   kill all other tasks attached to PID_NS2, and wait for all of them to die
   before reaping itself (zap_pid_ns_process()). Note it seems to make a
   misleading assumption here, trusting that all tasks in PID_NS2 either
   get reaped by a parent belonging to the same namespace or by TASK B.
   And it is confident that since it deactivated SIGCHLD handler, all
   the remaining tasks ultimately autoreap. And it waits for that to happen.
   However TASK C escapes that rule because it will get reaped by its parent
   TASK A belonging to PID_NS1.

4) TASK A calls synchronize_rcu_tasks() which leads to
   synchronize_srcu(&tasks_rcu_exit_srcu).

5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
   But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
   (exit_notify() is between exit_tasks_rcu_start() and
   exit_tasks_rcu_finish()), blocking TASK A

6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
   but it can't because TASK A waits for TASK B that waits for TASK C.

So there is a circular dependency:

_ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
section
_ TASK B waits for TASK C to get reaped
_ TASK C waits for TASK A to reap it.

An example stack trace is:

kernel: INFO: task rcu_tasks_trace:15 blocked for more than 121 seconds.
kernel:       Not tainted 6.8.0-63-generic #66-Ubuntu
kernel: task:rcu_tasks_trace state:D stack:0     pid:15    tgid:15    ppid:2      flags:0x00004000
kernel: Call Trace:
kernel:  <TASK>
kernel:  __schedule+0x27c/0x6b0
kernel:  schedule+0x33/0x110
kernel:  schedule_timeout+0x157/0x170
kernel:  wait_for_completion+0x88/0x150
kernel:  __wait_rcu_gp+0x17e/0x190
kernel:  synchronize_rcu+0x12d/0x140
kernel:  ? __pfx_call_rcu_hurry+0x10/0x10
kernel:  ? __pfx_wakeme_after_rcu+0x10/0x10
kernel:  rcu_tasks_trace_postscan+0xe/0x20
kernel:  rcu_tasks_wait_gp+0x119/0x310
kernel:  ? _raw_spin_lock_irqsave+0xe/0x20
kernel:  ? rcu_tasks_need_gpcb+0x1f7/0x350
kernel:  ? __pfx_rcu_tasks_kthread+0x10/0x10
kernel:  rcu_tasks_one_gp+0x122/0x150
kernel:  rcu_tasks_kthread+0xa4/0xd0
kernel:  kthread+0xef/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x44/0x70
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>
kernel: task:system-probe    state:D stack:0     pid:1989  tgid:1931  ppid:1926   flags:0x00000002
kernel: Call Trace:
kernel:  <TASK>
kernel:  __schedule+0x27c/0x6b0
kernel:  schedule+0x33/0x110
kernel:  schedule_timeout+0x157/0x170
kernel:  wait_for_completion+0x88/0x150
kernel:  __wait_rcu_gp+0x17e/0x190
kernel:  synchronize_rcu_tasks_generic+0x64/0xe0
kernel:  ? __pfx_call_rcu_tasks_trace+0x10/0x10
kernel:  ? __pfx_wakeme_after_rcu+0x10/0x10
kernel:  synchronize_rcu_tasks_trace+0x15/0x20
kernel:  perf_event_detach_bpf_prog+0x7d/0xe0
kernel:  _free_event+0x20e/0x2a0
kernel:  perf_event_release_kernel+0x281/0x2e0
kernel:  perf_release+0x15/0x30
kernel:  __fput+0xa0/0x2e0
kernel:  __fput_sync+0x1c/0x30
kernel:  __x64_sys_close+0x3e/0x90
kernel:  x64_sys_call+0x1fec/0x25a0
kernel:  do_syscall_64+0x7f/0x180
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? filp_flush+0x57/0x90
kernel:  ? syscall_exit_to_user_mode+0x86/0x260
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? restore_fpregs_from_fpstate+0x3d/0xd0
kernel:  ? switch_fpu_return+0x55/0xf0
kernel:  ? filp_flush+0x57/0x90
kernel:  ? syscall_exit_to_user_mode+0x86/0x260
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? filp_flush+0x57/0x90
kernel:  ? syscall_exit_to_user_mode+0x86/0x260
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? do_syscall_64+0x8c/0x180
kernel:  ? irqentry_exit_to_user_mode+0x7b/0x260
kernel:  ? irqentry_exit+0x43/0x50
kernel:  entry_SYSCALL_64_after_hwframe+0x78/0x80

[Fix]

The entire patchset is listed below. 3 out of the 7 have already been applied to
ubuntu-noble due to being a dependency of another commit. We only need the 4
missing commits.

This was mainlined in 6.9-rc1 by the following commits:

commit 2eb52fa8900e642b3b5054c4bf9776089d2a935f
Author: Paul E. McKenney <paulmck at kernel.org>
Date:   Mon Dec 4 09:33:29 2023 -0800
Subject: rcu-tasks: Repair RCU Tasks Trace quiescence check
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2eb52fa8900e642b3b5054c4bf9776089d2a935f
Applied: Yes. ubuntu-noble 7e16c7d2a1ee

commit bfe93930ea1ea3c6c115a7d44af6e4fea609067e
Author: Paul E. McKenney <paulmck at kernel.org>
Date:   Mon Feb 5 13:08:22 2024 -0800
Subject: rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfe93930ea1ea3c6c115a7d44af6e4fea609067e
Applied: Yes. ubuntu-noble b9014deb33e6

commit 30ef09635b9ed3ebca4f677495332a2e444a5cda
Author: Paul E. McKenney <paulmck at kernel.org>
Date:   Thu Feb 22 12:29:54 2024 -0800
Subject: rcu-tasks: Initialize callback lists at rcu_init() time
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=30ef09635b9ed3ebca4f677495332a2e444a5cda
Applied. No. Needed.

commit 46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
Author: Paul E. McKenney <paulmck at kernel.org>
Date:   Mon Feb 5 13:10:19 2024 -0800
Subject: rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=46faf9d8e1d52e4a91c382c6c72da6bd8e68297b
Applied: Yes. ubuntu-noble c8da4b0160db

commit 6b70399f9ef3809f6e308fd99dd78b072c1bd05c
Author: Paul E. McKenney <paulmck at kernel.org>
Date:   Fri Feb 2 11:28:45 2024 -0800
Subject: rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6b70399f9ef3809f6e308fd99dd78b072c1bd05c
Applied: No. Needed.

commit 1612160b91272f5b1596f499584d6064bf5be794
Author: Paul E. McKenney <paulmck at kernel.org>
Date:   Fri Feb 2 11:49:06 2024 -0800
Subject: rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1612160b91272f5b1596f499584d6064bf5be794
Applied: No. Needed.

commit 0bb11a372fc8d7006b4d0f42a2882939747bdbff
Author: Paul E. McKenney <paulmck at kernel.org>
Date:   Thu Feb 1 06:10:26 2024 -0800
Subject: rcu-tasks: Maintain real-time response in rcu_tasks_postscan()
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0bb11a372fc8d7006b4d0f42a2882939747bdbff
Applied: No. Needed.

The 4 needed commits are all clean cherry picks.

[Testcase]

To reproduce the do_exit() deadlock using the syzkaller repro:

$ sudo apt install build-essential
$ wget https://raw.githubusercontent.com/xupengfe/syzkaller_logs/refs/heads/main/221115_105658_synchronize_rcu/repro.c
$ gcc -o repro repro.c
$ sudo ./repro
$ journalctl -f -t kernel

Due to the high regression risk of this patchset, we should run rcutorture, the
rcu test suite, over a patched kernel to ensure there are no deadlocks.

To run rcutorture on the kernel build:

Documentation:
https://docs.kernel.org/RCU/torture.html

1) Clone the kernel source code
2) Save the following patch to enable CONFIG_RCU_TORTURE_TEST to 
0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
https://launchpadlibrarian.net/805611005/0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
3) $ git am 0001-UBUNTU-Config-Enable-CONFIG_RCU_TORTURE_TEST.patch
4) Build a new kernel with the patch applied, boot into it.
5) $ modprobe rcutorture
6) Follow dmesg.
$ journalctl -f -t kernel
kernel: rcu-torture: rcu_torture_read_exit: Start of episode
kernel: rcu-torture: rcu_torture_read_exit: End of episode
kernel: rcu_torture_fwd_prog_nr: 0 Duration 50060 cver 1081 gps 1490
kernel: rcu_torture_fwd_prog_nr: Waiting for CBs: rcu_barrier+0x0/0x80() 0
kernel: rcu-torture: rtc: 00000000c099ebf1 ver: 62341 tfle: 0 rta: 62342 rtaf: 0 rtf: 62331 rtmbe: 0 rtmbkf: 0/48597 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 0 nt: 1396993 onoff: 0/0:0/0 -1,0:-1,0 0:0 (HZ=1000) barrier: 0/0:0 read-exits: 1792 nocb-toggles: 0:0
kernel: rcu-torture: Reader Pipe:  2350715188 99444 0 0 0 0 0 0 0 0 0
kernel: rcu-torture: Reader Batch:  2350551525 263107 0 0 0 0 0 0 0 0 0
kernel: rcu-torture: Free-Block Circulation:  62341 62340 62339 62338 62336 62335 62334 62333 62332 62331 0

Read the documentation and ensure you see "Success" and no "FAILURE" messages.
Ensure all the values that should be 0 are indeed 0.

Leave rcutorture running for several hours / days.

There is a test kernel available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf411904-config

If you install it, it should not deadlock on the reproducer anymore, and you can
also load the rcutorture kernel module for regression testing.

[Where problems could occur]

We are changing what happens to tasks that are late in do_exit(), and are now
adding them to a new list to keep track of them while they could be in a RCU
critical section.

These are some large changes to the RCU subsystem, and it affects nearly other
subsystem of the kernel, as RCU is used everywhere.

If a regression were to occur, it would involve RCU grace periods getting stuck,
leading to deadlocks and hung task timeouts with no real workarounds.

We need to ensure we test this change with rcutorture for the whole duration the
kernel is in -proposed for.

[Other info]

Upstream mailing list report:
https://lore.kernel.org/lkml/Y3sOgrOmMQqPMItu@xpf.sh.intel.com/T/#u

Paul E. McKenney's architecture document:
https://docs.google.com/document/d/1hJxgiZ5TMZ4YJkdJPLAkRvq7sYQ-A7svgA8no6i-v8k/edit?usp=sharing

syzkaller scripts, C reproducer, dmesg logs:
https://github.com/xupengfe/syzkaller_logs/tree/main/221115_105658_synchronize_rcu

Upstream mailing list submission:
https://lore.kernel.org/lkml/20240217012745.3446231-1-boqun.feng@gmail.com/T/#u

Paul E. McKenney (4):
  rcu-tasks: Initialize callback lists at rcu_init() time
  rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks
  rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks
  rcu-tasks: Maintain real-time response in rcu_tasks_postscan()

 kernel/rcu/rcu.h   |   6 +++
 kernel/rcu/tasks.h | 131 ++++++++++++++++++++++++++++++++++-----------
 kernel/rcu/tiny.c  |   1 +
 kernel/rcu/tree.c  |   2 +
 4 files changed, 108 insertions(+), 32 deletions(-)

-- 
2.50.0