[jammy] SRU: NFS: fix deadlock with pNFS flexfiles IO retry error path

Fri Nov 22 19:16:01 UTC 2024

SRU Justification:

Impact: In pre-production at a mutual "hyperscaler" customer that is
using the Ubuntu jammy kernel's NFS client with Hammerspace's pNFS
flexfiles: NFS client deadlock occurred due to upstream commit
7be7b3ca16a59 ("NFS: Ensure we immediately start writeback on
rescheduled writes"). Which was later fixed with upstream commit
b1a28f2eb9ea7 ("NFS: nfs_async_write_reschedule_io must not recurse
into the writeback code") in August 2022. But it unfortunately wasn't
marked for stable@ at that time. That has since been rectified and
Greg Kroah-Hartman has now picked it up for the next
stable/linux-5.15.y kernel (but it hasn't yet appeared in the stable
repo yet), please see:
https://lore.kernel.org/stable/2024112146-tiptoeing-available-c5fe@gregkh/T/

Fix: Apply upstream commit b1a28f2eb9ea7 ("NFS:
nfs_async_write_reschedule_io must not recurse into the writeback
code"), that commit was developed by and came from Trond Myklebust the
upstream Linux NFS client maintainer.

Testcase: Cause buffered IO issued by NFS client using pNFS flexfiles
to hit error paths (due to heavy enterprise use, with container limits
being imposed, which makes OOM within container particularly prone to
hit error memory allocation errors _and_ additional reason for NFS IO
to be retransmitted, e.g. due to volume down/up bounces). This can
lead to deadlock in NFS due to recursion with page locks already held,
e.g.:
[<0>] wait_on_page_bit_common+0x10c/0x3d0
[<0>] wait_on_page_bit+0x3f/0x50
[<0>] wait_on_page_writeback+0x26/0x80
[<0>] write_cache_pages+0x138/0x460
[<0>] nfs_writepages+0x10d/0x200 [nfs]
[<0>] do_writepages+0xd4/0x200
[<0>] filemap_fdatawrite_wbc+0x89/0xe0
[<0>] filemap_fdatawrite_range+0x54/0x70
[<0>] nfs_async_write_reschedule_io+0x69/0x80 [nfs]
[<0>] ff_layout_reset_write+0x73/0xe0 [nfs_layout_flexfiles]
[<0>] ff_layout_write_release+0x7a/0x90 [nfs_layout_flexfiles]
[<0>] rpc_free_task+0x3d/0x70 [sunrpc]
[<0>] rpc_async_release+0x30/0x50 [sunrpc]
[<0>] process_one_work+0x228/0x3d0
[<0>] worker_thread+0x53/0x420
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30~

[ Upstream commit b1a28f2eb9ea7a5a1763fe53fe699aa0feae4231 ]

commit b1a28f2eb9ea7a5a1763fe53fe699aa0feae4231
Author: Trond Myklebust <trond.myklebust at hammerspace.com>
Date:   Mon Aug 1 14:16:51 2022 -0400

    NFS: nfs_async_write_reschedule_io must not recurse into the writeback code

    It is not safe to call filemap_fdatawrite_range() from
    nfs_async_write_reschedule_io(), since we're often calling from a page
    reclaim context. Just let fsync() redrive the writeback for us.

    Signed-off-by: Trond Myklebust <trond.myklebust at hammerspace.com>

Fixes: 7be7b3ca16a59 ("NFS: Ensure we immediately start writeback on rescheduled writes")
Link: https://lore.kernel.org/stable/2024112146-tiptoeing-available-c5fe@gregkh/T/
Link: https://bugs.launchpad.net/ubuntu/+source/kernel-package/+bug/2089410
Signed-off-by: Mike Snitzer <snitzer at hammerspace.com>
Signed-off-by: Mike Snitzer <snitzer at kernel.org>
---
 fs/nfs/write.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 2e1c45344d1fd..00e11c6602fe4 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1437,8 +1437,6 @@ static void nfs_async_write_error(struct list_head *head, int error)
 static void nfs_async_write_reschedule_io(struct nfs_pgio_header *hdr)
 {
 	nfs_async_write_error(&hdr->pages, 0);
-	filemap_fdatawrite_range(hdr->inode->i_mapping, hdr->args.offset,
-			hdr->args.offset + hdr->args.count - 1);
 }
 
 static const struct nfs_pgio_completion_ops nfs_async_write_completion_ops = {
-- 
2.44.0