[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
yuhldr
2062568 at bugs.launchpad.net
Sun Jun 23 13:07:07 UTC 2024
I encountered the same problem. After several days of testing, the
problem can be reproduced 100%. Ubuntu24.04, a 10Gb/s optical fiber
connection is used between the login node and the computing node. The
computing node uses nfs to mount the /home of the login node. The entire
system is managed using slurm.
The login node submits files that require a large number of reading and
writing files on /home. During the program, my program reads local txt
of about 10GB in size by python-numpy, and then separates it into
multiple small files of 100MB and saves them as npy files.
I submit 252 similar programs at one time and run them at the same time,
within one hour. The nfs service of the login node is stuck. At this
time, the nfs-server service of the login node cannot be restarted. The
login node cannot ssh to the computing node, and the problem of
restarting the computing node still exists. However, the problem of just
restarting the login node disappears, ssh is restored, and the computing
node Automatically connect to nfs successfully.
```bash
root 1548 0.0 0.0 5632 1792 ? Ss 18:19 0:00 /usr/sbin/nfsdcld
root 2347 4.6 0.0 0 0 ? D 18:19 8:04 [nfsd]
root 53326 0.0 0.0 0 0 ? D 20:00 0:00 [kworker/u112:2+nfsd4_callbacks]
root 68918 0.0 0.0 2704 1792 ? Is 20:47 0:00 /usr/sbin/rpc.nfsd 0
root 74448 0.0 0.0 9436 2240 pts/6 S+ 21:11 0:00 grep --color=auto --ex
```
```log
6月 23 20:48:52 icpcs systemd[1]: nfs-server.service: Stopping timed out. Terminating.
6月 23 20:49:10 icpcs sudo[69464]: root : TTY=pts/6 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/systemctl status nfs-server.service
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: State 'stop-sigterm' timed out. Killing.
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: Killing process 68918 (rpc.nfsd) with signal SIGKILL.
6月 23 20:50:27 icpcs kernel: INFO: task nfsd:2347 blocked for more than 1105 seconds.
6月 23 20:50:27 icpcs kernel: task:nfsd state:D stack:0 pid:2347 tgid:2347 ppid:2 flags:0x00004000
6月 23 20:50:27 icpcs kernel: nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd4_destroy_session+0x186/0x260 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd4_proc_compound+0x3af/0x770 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd_dispatch+0xd4/0x220 [nfsd]
6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd+0x8b/0xe0 [nfsd]
```
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to nfs-utils in Ubuntu.
https://bugs.launchpad.net/bugs/2062568
Title:
nfsd gets unresponsive after some hours of operation
Status in nfs-utils package in Ubuntu:
Confirmed
Bug description:
I installed the 24.04 Beta on two test machines that were running
22.04 without issues before. One of them exports two volumes that are
mounted by the other machine, which primarily uses them as a secondary
storage for ccache.
After being up for a couple of hours (happened twice since yesterday
evening) it seems that nfsd on the machine exporting the volumes hangs
on something.
From dmesg on the server (repeated a few times):
[11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds.
[11183.290558] Not tainted 6.8.0-22-generic #22-Ubuntu
[11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11183.290582] task:nfsd state:D stack:0 pid:1419 tgid:1419 ppid:2 flags:0x00004000
[11183.290587] Call Trace:
[11183.290602] <TASK>
[11183.290606] __schedule+0x27c/0x6b0
[11183.290612] schedule+0x33/0x110
[11183.290615] schedule_timeout+0x157/0x170
[11183.290619] wait_for_completion+0x88/0x150
[11183.290623] __flush_workqueue+0x140/0x3e0
[11183.290629] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[11183.290689] nfsd4_destroy_session+0x186/0x260 [nfsd]
[11183.290744] nfsd4_proc_compound+0x3af/0x770 [nfsd]
[11183.290798] nfsd_dispatch+0xd4/0x220 [nfsd]
[11183.290851] svc_process_common+0x44d/0x710 [sunrpc]
[11183.290924] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[11183.290976] svc_process+0x132/0x1b0 [sunrpc]
[11183.291041] svc_handle_xprt+0x4d3/0x5d0 [sunrpc]
[11183.291105] svc_recv+0x18b/0x2e0 [sunrpc]
[11183.291168] ? __pfx_nfsd+0x10/0x10 [nfsd]
[11183.291220] nfsd+0x8b/0xe0 [nfsd]
[11183.291270] kthread+0xef/0x120
[11183.291274] ? __pfx_kthread+0x10/0x10
[11183.291276] ret_from_fork+0x44/0x70
[11183.291279] ? __pfx_kthread+0x10/0x10
[11183.291281] ret_from_fork_asm+0x1b/0x30
[11183.291286] </TASK>
From dmesg on the client (repeated a number of times):
[ 6596.911785] RPC: Could not send backchannel reply error: -110
[ 6596.972490] RPC: Could not send backchannel reply error: -110
[ 6837.281307] RPC: Could not send backchannel reply error: -110
ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: nfs-kernel-server 1:2.6.4-3ubuntu5
ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1
Uname: Linux 6.8.0-22-generic x86_64
.etc.request-key.d.id_resolver.conf: create id_resolver * * /usr/sbin/nfsidmap -t 600 %k %d
ApportVersion: 2.28.1-0ubuntu1
Architecture: amd64
CasperMD5CheckResult: pass
Date: Fri Apr 19 14:10:25 2024
InstallationDate: Installed on 2024-04-16 (3 days ago)
InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 (20240410.1)
NFSMounts:
NFSv4Mounts:
ProcEnviron:
LANG=en_US.UTF-8
PATH=(custom, no user)
SHELL=/bin/bash
TERM=xterm-256color
XDG_RUNTIME_DIR=<set>
SourcePackage: nfs-utils
UpgradeStatus: No upgrade log present (probably fresh install)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions
More information about the foundations-bugs
mailing list