[SRU][J][PATCH v2 0/6] Fix hung task when heavily accessing kernfs files

Fri Sep 19 14:52:53 UTC 2025

BugLink: https://bugs.launchpad.net/bugs/2125142

[impact]

When running heavy IO operations on sysfs, the kernel will lock up and many hung tasks will be printed in dmesg.
This issue is due to a bug in the kernfs driver and was fixed upstream through 4 commits:

    06fb4736139f kernfs: change kernfs_rename_lock into a read-write lock.
    c9f2dfb7b59e kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.
    9caf69614225 kernfs: Introduce separate rwsem to protect inode attributes.
    393c3714081a kernfs: switch global kernfs_rwsem lock to per-fs lock

I have backported these commits and also backported two other required commits that these commit depend on:

44a41882575b kernfs: dont take i_lock on inode attr read.
f44a3ca1e533 kernfs: move struct kernfs_root out of the public view.

One of the main signs of hitting this bug, is seeing a huge number of tasks in the Uninterruptible state when analyzing
the core dump:

crash> ps -S
  RU: 99
  UN: 7725
  IN: 6113
  ID: 444

Stack traces can be very different based on the application locking up but what they all have in common is that they are
waiting for a mutex to unlock:

[308188.415026] INFO: task READACTED:15703 blocked for more than 360 seconds.
[308188.415362] Not tainted 5.15.0-153-generic #163-Ubuntu
[308188.415749] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[308188.416054] task:REDACTED state:D stack: 0 pid:15703 ppid: 15160 flags:0x00000000
[308188.416347] Call Trace:
[308188.416634] <TASK>
[308188.416917] __schedule+0x24e/0x590
[308188.417202] schedule+0x69/0x110
[308188.417506] schedule_preempt_disabled+0xe/0x20
[308188.417789] __mutex_lock.constprop.0+0x267/0x490
[308188.418052] ? rtnl_getlink+0x420/0x420
[308188.418311] __mutex_lock_slowpath+0x13/0x20
[308188.418575] mutex_lock+0x38/0x50
[308188.418818] __netlink_dump_start+0xbf/0x2f0
[308188.419061] ? rtnl_getlink+0x420/0x420
[308188.419321] rtnetlink_rcv_msg+0x2af/0x400
[308188.419564] ? rtnl_getlink+0x420/0x420
[308188.419807] ? rtnl_calcit.isra.0+0x130/0x130
[308188.420051] netlink_rcv_skb+0x53/0x100
[308188.420297] rtnetlink_rcv+0x15/0x20
[308188.420539] netlink_unicast+0x220/0x340
[308188.420781] netlink_sendmsg+0x24b/0x4c0
[308188.421023] __sock_sendmsg+0x66/0x70
[308188.421262] __sys_sendto+0x113/0x190
[308188.421511] ? __audit_syscall_exit+0x269/0x2d0
[308188.421754] ? __audit_syscall_entry+0xde/0x120
[308188.422024] __x64_sys_sendto+0x24/0x30
[308188.422258] x64_sys_call+0x1bcb/0x1fa0
[308188.422485] do_syscall_64+0x56/0xb0
[308188.422711] entry_SYSCALL_64_after_hwframe+0x6c/0xd6
[308188.422946] RIP: 0033:0x48fbae
[308188.423196] RSP: 002b:000000c0038b1480 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
[308188.423454] RAX: ffffffffffffffda RBX: 000000000000008a RCX: 000000000048fbae
[308188.423694] RDX: 0000000000000011 RSI: 000000c0074b84e0 RDI: 000000000000008a
[308188.423935] RBP: 000000c0038b14c0 R08: 000000c0074b84d4 R09: 000000000000000c
[308188.424180] R10: 0000000000000000 R11: 0000000000000216 R12: 000000c0074b84e0
[308188.424422] R13: 0000000000000000 R14: 000000c0039a2540 R15: 00000000000007c6
[308188.424667] </TASK>

The issue currently affects Jammy 5.15 kernel and is not present in the 6.8 kernel or above.

[Test Plan]

Reproducing the issue requires a machine that is able to run a high number of applications concurrently that are hammering
kernfs with I/O requests.

commit "9caf69614225 kernfs: Introduce separate rwsem to protect inode attributes" contains a C program snippet reproducer.

Although in this loop the sysfs being accessed is for an infiniband device, having a any similar program that reads files
from sysfs with multiple instances of the program running, will trigger the issue.
An example could be reading data from "/sys/class/net/<interface>/statistics/*" instead.

In other words running multiple instances of a program that does IO operations on sysfs would confirm if the issue remains
or is resolved. To reliably reproduce or confirm the resolution of the issue, dozens if not hundreds of instances should
be executed.

[ Where problems could occur ]

* Since the fix changes the semaphore being used to access kernfs files from one that protects an entire kernfs filesystem
to one that protects individual inodes, there is a chance that this might cause race conditions on read/write.
* This might also cause deadlocks similar to the ones this bug is reporting.

Greg Kroah-Hartman (1):
  kernfs: move struct kernfs_root out of the public view.

Ian Kent (1):
  kernfs: dont take i_lock on inode attr read

Imran Khan (3):
  kernfs: Introduce separate rwsem to protect inode attributes.
  kernfs: Use a per-fs rwsem to protect per-fs list of
    kernfs_super_info.
  kernfs: change kernfs_rename_lock into a read-write lock.

Minchan Kim (1):
  kernfs: switch global kernfs_rwsem lock to per-fs lock

 arch/x86/kernel/cpu/resctrl/rdtgroup.c |   4 +-
 fs/kernfs/dir.c                        | 145 ++++++++++++++++---------
 fs/kernfs/file.c                       |   6 +-
 fs/kernfs/inode.c                      |  26 +++--
 fs/kernfs/kernfs-internal.h            |  20 ++++
 fs/kernfs/mount.c                      |  15 ++-
 fs/kernfs/symlink.c                    |   5 +-
 fs/sysfs/mount.c                       |   2 +-
 include/linux/kernfs.h                 |   6 +
 kernel/cgroup/cgroup.c                 |   4 +-
 10 files changed, 152 insertions(+), 81 deletions(-)

-- 
2.34.1