ACK: [SRU][J/N][PATCH 0/1] nfsd hangs and never recovers after NFS4ERR_DELAY and a connection loss

Wed Mar 19 11:58:11 UTC 2025

On 25/03/19 03:21pm, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2103564
> 
> [Impact]
> 
> nfsd loops forever in nfsd4_cb_sequence_done() after it receives a NFS4ERR_DELAY
> and the connection is subsequently lost.
> 
> What happens is that NFS4ERR_DELAY sets cb->cb_seq_status to -10008, but it is
> never set back to 1, so it just keeps sending NFS4ERR_DELAY.
> 
> The stack trace looks like:
> 
> watchdog: BUG: soft lockup - CPU#33 stuck for 22s! [kworker/u120:29:1520679]
>  Kernel panic - not syncing: softlockup: hung tasks
> CPU: 33 PID: 1520679 Comm: kworker/u120:29 Tainted: G L 5.15.0-1069-gke #75-Ubuntu
> Workqueue: rpciod rpc_async_schedule [sunrpc]
> Call Trace:
>   RIP: 0010:__rpc_sleep_on_priority_timeout+0x7b/0x110 [sunrpc]
>   Code: 0f b6 f9 66 90 44 89 fa 48 89 de 4d 8d 7e 50 4c 89 f7 e8 c8 fb ff ff 4c 89 6b 28 49 8b 46 50 49 39 c7 74 5a 4d 3b 6e 60 78 54 <49> 8b 56 50 48 8d 43 60 48 89 42 08 48 89 53 60 4c 89 7b 68 49 89
> ...
>   rpc_sleep_on_timeout+0x56/0xa0 [sunrpc]
>   rpc_delay+0x29/0x30 [sunrpc]
>   nfsd4_cb_sequence_done+0x1b9/0x250 [nfsd]
>   nfsd4_cb_done+0x1d/0xf0 [nfsd]
>   pc_exit_task+0x5c/0x110 [sunrpc]
>   ? __rpc_sleep_on_priority+0x80/0x80 [sunrpc]
>   __rpc_execute+0x68/0x270 [sunrpc]
>   rpc_async_schedule+0x30/0x50 [sunrpc]
>   process_one_work+0x22b/0x3d0
>   worker_thread+0x53/0x420
>   ? process_one_work+0x3d0/0x3d0
>   kthread+0x12a/0x150
>   ? set_kthread_struct+0x50/0x50
>   ret_from_fork+0x22/0x30
>   </TASK>
> 
> There is no workaround.
> 
> [Fix]
> 
> This was fixed in 6.9-rc1 by:
> 
> commit 961b4b5e86bf56a2e4b567f81682defa5cba957e
> From: Chuck Lever <chuck.lever at oracle.com>
> Date: Fri, 26 Jan 2024 12:45:17 -0500
> Subject: NFSD: Reset cb_seq_status after NFS4ERR_DELAY
> Link: https://github.com/torvalds/linux/commit/961b4b5e86bf56a2e4b567f81682defa5cba957e
> 
> This is present in 5.15.179 and 6.6.76 upstream stable.
> 
> [Testcase]
> 
> There is no known synthetic reproducer available. 
> 
> Currently we see it in production workloads on Google Kubernetes Engine, and
> we have successfully deployed and ran a test kernel in production with no
> further incidents occurring. Before it would lock up once a day.
> 
> The test kernel is available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf407307-test
> 
> If you install the kernel from the ppa, the issue no longer occurs.
> 
> [Where problems can occur]
> 
> We are resetting the value of cb->cb_seq_status back to 1 to let it get out of
> its state machine, and to actually make some progress, instead of being
> trapped at NFS4ERR_DELAY.
> 
> If a regression were to occur, it would affect NFS v4.x systems, and it wouldn't
> likely cause any real issues, likely some flapping between NFS4ERR_DELAY and
> sending callbacks.
> 
> Chuck Lever (1):
>   NFSD: Reset cb_seq_status after NFS4ERR_DELAY
> 
>  fs/nfsd/nfs4callback.c | 1 +
>  1 file changed, 1 insertion(+)

Acked-by: Andrei Gherzan <andrei.gherzan at canonical.com>

-- 
Andrei Gherzan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250319/a11379b4/attachment.sig>