[Bug 2062568] Re: nfsd gets unresponsive after some hours of operation
Jeff
2062568 at bugs.launchpad.net
Tue Jun 18 03:47:32 UTC 2024
The tricky part for me is that client was regularly changing, so I can't
confidently say when did errors start appearing, it's just very
suspicious that it needs high load (as new host and higher network
bandwidth made the issue more frequent), and uploading to the server as
just pure downloading doesn't seem to be a problem even if cached data
is getting sent at full bandwidth for minutes.
Moved the server to 24.04, but I've also moved some I/O heavy tasks to
it so there would be less need of uploading. Client was on 23.10 and I'm
still holding back on upgrading for some more weeks.
Can't say a whole lot about the current situation as I'm not uploading much anymore to avoid the issue, but I actually ran into a hanging issue a few days ago, I just didn't have time to debug it, but the server didn't want to gracefully restart, so ended up hard rebooting.
I believe it was the first time since moving I/O heavy tasks, wanted to upload a few hundred GiB of data back to the server which was downloaded from there a while ago without problems. Otherwise light I/O doesn't seem to run into this problem, like the occasional backup to the server is fine, but that rarely saturates the network, and likely completely fits into the page cache almost every time.
A few hopefully helpful points for reproducing the problem:
- As mentioned multiple times, download alone seems to be unaffected, uploading is what should be stressed, and I suspect that either there's no need to download at the same time, or just casual filesystem browsing is a good enough load.
- A fast client with high bandwidth is key. Ran into this issue a couple times with an older host on 1 Gb/s, but a new fast host with 2.5 Gb/s made the issue appear significantly more frequently.
- Likely doesn't matter how the link gets saturated, but I either processed files cached on the server (mixed R/W), or uploaded cached files (fast SSD should be fine too), meaning that the bottleneck was always the network at least while the caches were large enough.
- Files were large, so there wasn't any stopping for fiddling with metadata as it would happen with small files, and the page cache was often exhausted. The target was a single HDD the majority of the time which often meant that writes started blocking (100-ish MiB/s HDD catching up with close to 250 MiB/s data), occasionally making the hosts freeze as the kernel's background I/O handling is still bad, we just pretend the issue is gone with SSDs being fast enough not to run into this. The page cache draining freezes may be good at exposing race conditions.
It may be more efficient to start looking for what's causing the "RPC: Could not send backchannel reply error: -110" log spam which might be related. The lockup may take significant time to catch while that kernel message showed up quite frequently.
Even now I have plenty of those lines without experiencing issues and not even uploading much, mostly just downloading large files.
Some extra info which may or may not matter:
- The server hardware is quite weak with an old 4 core Broadwell CPU, possibly helping to expose race condition problems
- All file systems are Btrfs with noatime,discard=async,compress-force=zstd , the later part surely adding more load
- LUKS is used everywhere, also adding some extra load
- There's a Btrfs (on LUKS) image mounted over NFS (with not a whole lot of usage though)
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to nfs-utils in Ubuntu.
https://bugs.launchpad.net/bugs/2062568
Title:
nfsd gets unresponsive after some hours of operation
Status in nfs-utils package in Ubuntu:
Confirmed
Bug description:
I installed the 24.04 Beta on two test machines that were running
22.04 without issues before. One of them exports two volumes that are
mounted by the other machine, which primarily uses them as a secondary
storage for ccache.
After being up for a couple of hours (happened twice since yesterday
evening) it seems that nfsd on the machine exporting the volumes hangs
on something.
From dmesg on the server (repeated a few times):
[11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds.
[11183.290558] Not tainted 6.8.0-22-generic #22-Ubuntu
[11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11183.290582] task:nfsd state:D stack:0 pid:1419 tgid:1419 ppid:2 flags:0x00004000
[11183.290587] Call Trace:
[11183.290602] <TASK>
[11183.290606] __schedule+0x27c/0x6b0
[11183.290612] schedule+0x33/0x110
[11183.290615] schedule_timeout+0x157/0x170
[11183.290619] wait_for_completion+0x88/0x150
[11183.290623] __flush_workqueue+0x140/0x3e0
[11183.290629] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[11183.290689] nfsd4_destroy_session+0x186/0x260 [nfsd]
[11183.290744] nfsd4_proc_compound+0x3af/0x770 [nfsd]
[11183.290798] nfsd_dispatch+0xd4/0x220 [nfsd]
[11183.290851] svc_process_common+0x44d/0x710 [sunrpc]
[11183.290924] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[11183.290976] svc_process+0x132/0x1b0 [sunrpc]
[11183.291041] svc_handle_xprt+0x4d3/0x5d0 [sunrpc]
[11183.291105] svc_recv+0x18b/0x2e0 [sunrpc]
[11183.291168] ? __pfx_nfsd+0x10/0x10 [nfsd]
[11183.291220] nfsd+0x8b/0xe0 [nfsd]
[11183.291270] kthread+0xef/0x120
[11183.291274] ? __pfx_kthread+0x10/0x10
[11183.291276] ret_from_fork+0x44/0x70
[11183.291279] ? __pfx_kthread+0x10/0x10
[11183.291281] ret_from_fork_asm+0x1b/0x30
[11183.291286] </TASK>
From dmesg on the client (repeated a number of times):
[ 6596.911785] RPC: Could not send backchannel reply error: -110
[ 6596.972490] RPC: Could not send backchannel reply error: -110
[ 6837.281307] RPC: Could not send backchannel reply error: -110
ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: nfs-kernel-server 1:2.6.4-3ubuntu5
ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1
Uname: Linux 6.8.0-22-generic x86_64
.etc.request-key.d.id_resolver.conf: create id_resolver * * /usr/sbin/nfsidmap -t 600 %k %d
ApportVersion: 2.28.1-0ubuntu1
Architecture: amd64
CasperMD5CheckResult: pass
Date: Fri Apr 19 14:10:25 2024
InstallationDate: Installed on 2024-04-16 (3 days ago)
InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 (20240410.1)
NFSMounts:
NFSv4Mounts:
ProcEnviron:
LANG=en_US.UTF-8
PATH=(custom, no user)
SHELL=/bin/bash
TERM=xterm-256color
XDG_RUNTIME_DIR=<set>
SourcePackage: nfs-utils
UpgradeStatus: No upgrade log present (probably fresh install)
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/2062568/+subscriptions
More information about the foundations-bugs
mailing list