[Bug 2084023] [NEW] "rcu: INFO: rcu_sched self-detected stall on" CPU caused by nfs
Mehmet Basaran
2084023 at bugs.launchpad.net
Wed Oct 9 09:55:39 UTC 2024
Public bug reported:
This bug report has been opened because of
https://bugs.launchpad.net/ubuntu/+source/nfs-
utils/+bug/2062568/comments/25. Contents:
We installed the unofficial kernel 6.8.0-46-generic-nfs on several NFS client servers on Saturday and have been testing it with high IO loads since then.
Unfortunately the server crashed again after about 40 hours with "rcu: INFO: rcu_sched self-detected stall on CPU".
The kernel 6.8.0-46-generic-nfs prevents the error message "RPC: Could not send backchannel reply error: -110",
but not the crashs which we have been struggling with since August 19th switching the kernel from 6.5.0-44-generic to 6.8.0-40-generic.
Our experiences with NFS server crashes are:
- We were able to reproduce the crashes in our production and test environments. Rarely after minutes, sometimes after hours or days, but sometimes not at all,
as we often stopped the experiments after 12 to 24 hours.
- We have not yet been able to reproduce a crash between a bare metal NFS server and a bare metal NFS client, but between a bare metal NFS server and a virtualized client.
- we could not reproduce a crash with NFS vers=4.0
- the crashs happens with or without GSSPROXY
Our setup:
- virtualized NFS 4.2 server with Ubuntu 22.04.5 LTS and kernel 5.15.0-122-generic
- virtualized NFS client with Ubuntu 22.04.5 LTS and kernel 6.8.0-40-generic or kernel 6.8.0-45-generic
- /etc/exports : /mnt/home nfsclient(sec=krb5,rw,root_squash,sync,no_subtree_check)
- /etc/fstab : nfsserver:/mnt/home /home nfs vers=4.2,rw,soft,sec=krb5,proto=tcp 0 0
- apt info nfs-common : Version: 1:2.6.1-1ubuntu1.2
syslog of NFS server after crash:
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: 54-....: (14998 ticks this GP) idle=2db/1/0x4000000000000000 softirq=32173387/32173387 fqs=7449
Sep 30 01:15:51 nfs-server.domain.de kernel: (t=15000 jiffies g=144775177 q=49782)
Sep 30 01:15:51 nfs-server.domain.de kernel: NMI backtrace for cpu 54
Sep 30 01:15:51 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu
Sep 30 01:15:51 nfs-server.domain.de kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
Sep 30 01:15:51 nfs-server.domain.de kernel: Workqueue: rpciod rpc_async_schedule [sunrpc]
Sep 30 01:15:51 nfs-server.domain.de kernel: Call Trace:
Sep 30 01:15:51 nfs-server.domain.de kernel: <IRQ>
Sep 30 01:15:51 nfs-server.domain.de kernel: show_stack+0x52/0x5c
Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack_lvl+0x4a/0x63
Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack+0x10/0x16
Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_cpu_backtrace.cold+0x4d/0x93
Sep 30 01:15:51 nfs-server.domain.de kernel: ? lapic_can_unplug_cpu+0x90/0x90
Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_trigger_cpumask_backtrace+0xec/0x100
Sep 30 01:15:51 nfs-server.domain.de kernel: arch_trigger_cpumask_backtrace+0x19/0x20
Sep 30 01:15:51 nfs-server.domain.de kernel: trigger_single_cpu_backtrace+0x44/0x4f
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu_dump_cpu_stacks+0x102/0x149
Sep 30 01:15:51 nfs-server.domain.de kernel: print_cpu_stall.cold+0x2f/0xe2
Sep 30...
** Affects: linux (Ubuntu)
Importance: Medium
Assignee: Mehmet Basaran (mehmetbasaran)
Status: In Progress
** Package changed: nfs-utils (Ubuntu) => linux
** Changed in: linux
Importance: Undecided => Medium
** Changed in: linux
Status: New => Incomplete
** Changed in: linux
Assignee: (unassigned) => Mehmet Basaran (mehmetbasaran)
** Also affects: linux (Ubuntu)
Importance: Undecided
Status: New
** No longer affects: linux
** Changed in: linux (Ubuntu)
Assignee: (unassigned) => Mehmet Basaran (mehmetbasaran)
** Changed in: linux (Ubuntu)
Status: New => In Progress
** Changed in: linux (Ubuntu)
Importance: Undecided => Medium
** Changed in: linux (Ubuntu)
Assignee: Mehmet Basaran (mehmetbasaran) => (unassigned)
** Changed in: linux (Ubuntu)
Assignee: (unassigned) => Mehmet Basaran (mehmetbasaran)
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to nfs-utils in Ubuntu.
https://bugs.launchpad.net/bugs/2084023
Title:
"rcu: INFO: rcu_sched self-detected stall on" CPU caused by nfs
Status in linux package in Ubuntu:
In Progress
Bug description:
This bug report has been opened because of
https://bugs.launchpad.net/ubuntu/+source/nfs-
utils/+bug/2062568/comments/25. Contents:
We installed the unofficial kernel 6.8.0-46-generic-nfs on several NFS client servers on Saturday and have been testing it with high IO loads since then.
Unfortunately the server crashed again after about 40 hours with "rcu: INFO: rcu_sched self-detected stall on CPU".
The kernel 6.8.0-46-generic-nfs prevents the error message "RPC: Could not send backchannel reply error: -110",
but not the crashs which we have been struggling with since August 19th switching the kernel from 6.5.0-44-generic to 6.8.0-40-generic.
Our experiences with NFS server crashes are:
- We were able to reproduce the crashes in our production and test environments. Rarely after minutes, sometimes after hours or days, but sometimes not at all,
as we often stopped the experiments after 12 to 24 hours.
- We have not yet been able to reproduce a crash between a bare metal NFS server and a bare metal NFS client, but between a bare metal NFS server and a virtualized client.
- we could not reproduce a crash with NFS vers=4.0
- the crashs happens with or without GSSPROXY
Our setup:
- virtualized NFS 4.2 server with Ubuntu 22.04.5 LTS and kernel 5.15.0-122-generic
- virtualized NFS client with Ubuntu 22.04.5 LTS and kernel 6.8.0-40-generic or kernel 6.8.0-45-generic
- /etc/exports : /mnt/home nfsclient(sec=krb5,rw,root_squash,sync,no_subtree_check)
- /etc/fstab : nfsserver:/mnt/home /home nfs vers=4.2,rw,soft,sec=krb5,proto=tcp 0 0
- apt info nfs-common : Version: 1:2.6.1-1ubuntu1.2
syslog of NFS server after crash:
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu: 54-....: (14998 ticks this GP) idle=2db/1/0x4000000000000000 softirq=32173387/32173387 fqs=7449
Sep 30 01:15:51 nfs-server.domain.de kernel: (t=15000 jiffies g=144775177 q=49782)
Sep 30 01:15:51 nfs-server.domain.de kernel: NMI backtrace for cpu 54
Sep 30 01:15:51 nfs-server.domain.de kernel: CPU: 54 PID: 153154 Comm: kworker/u480:36 Not tainted 5.15.0-122-generic #132-Ubuntu
Sep 30 01:15:51 nfs-server.domain.de kernel: Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
Sep 30 01:15:51 nfs-server.domain.de kernel: Workqueue: rpciod rpc_async_schedule [sunrpc]
Sep 30 01:15:51 nfs-server.domain.de kernel: Call Trace:
Sep 30 01:15:51 nfs-server.domain.de kernel: <IRQ>
Sep 30 01:15:51 nfs-server.domain.de kernel: show_stack+0x52/0x5c
Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack_lvl+0x4a/0x63
Sep 30 01:15:51 nfs-server.domain.de kernel: dump_stack+0x10/0x16
Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_cpu_backtrace.cold+0x4d/0x93
Sep 30 01:15:51 nfs-server.domain.de kernel: ? lapic_can_unplug_cpu+0x90/0x90
Sep 30 01:15:51 nfs-server.domain.de kernel: nmi_trigger_cpumask_backtrace+0xec/0x100
Sep 30 01:15:51 nfs-server.domain.de kernel: arch_trigger_cpumask_backtrace+0x19/0x20
Sep 30 01:15:51 nfs-server.domain.de kernel: trigger_single_cpu_backtrace+0x44/0x4f
Sep 30 01:15:51 nfs-server.domain.de kernel: rcu_dump_cpu_stacks+0x102/0x149
Sep 30 01:15:51 nfs-server.domain.de kernel: print_cpu_stall.cold+0x2f/0xe2
Sep 30...
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2084023/+subscriptions
More information about the foundations-bugs
mailing list