[SRU][J:linux-bluefield][PATCH v1 1/1] tcp: fix forever orphan socket caused by tcp_abort

Wed Jul 2 16:30:46 UTC 2025

From 1f314ba2e1ce988e4e91e6944e75bb9de3535cf9 Mon Sep 17 00:00:00 2001
Message-Id: <1f314ba2e1ce988e4e91e6944e75bb9de3535cf9.1750858824.git.saviram at nvidia.com>
In-Reply-To: <cover.1750858824.git.saviram at nvidia.com>
References: <cover.1750858824.git.saviram at nvidia.com>
From: Xueming Feng <kuro at kuroa.me>
Date: Mon, 26 Aug 2024 18:23:27 +0800
To: kernel-team at lists.ubuntu.com
Subject: [SRU][J:linux-bluefield][PATCH v1 1/1] tcp: fix forever orphan socket caused by tcp_abort

BugLink: https://bugs.launchpad.net/bugs/2114965

We have some problem closing zero-window fin-wait-1 tcp sockets in our
environment. This patch come from the investigation.

Previously tcp_abort only sends out reset and calls tcp_done when the
socket is not SOCK_DEAD, aka orphan. For orphan socket, it will only
purging the write queue, but not close the socket and left it to the
timer.

While purging the write queue, tp->packets_out and sk->sk_write_queue
is cleared along the way. However tcp_retransmit_timer have early
return based on !tp->packets_out and tcp_probe_timer have early
return based on !sk->sk_write_queue.

This caused ICSK_TIME_RETRANS and ICSK_TIME_PROBE0 not being resched
and socket not being killed by the timers, converting a zero-windowed
orphan into a forever orphan.

This patch removes the SOCK_DEAD check in tcp_abort, making it send
reset to peer and close the socket accordingly. Preventing the
timer-less orphan from happening.

According to Lorenzo's email in the v1 thread, the check was there to
prevent force-closing the same socket twice. That situation is handled
by testing for TCP_CLOSE inside lock, and returning -ENOENT if it is
already closed.

The -ENOENT code comes from the associate patch Lorenzo made for
iproute2-ss; link attached below, which also conform to RFC 9293.

At the end of the patch, tcp_write_queue_purge(sk) is removed because it
was already called in tcp_done_with_error().

p.s. This is the same patch with v2. Resent due to mis-labeled "changes
requested" on patchwork.kernel.org.

Link: https://patchwork.ozlabs.org/project/netdev/patch/1450773094-7978-3-git-send-email-lorenzo@google.com/
Fixes: c1e64e298b8c ("net: diag: Support destroying TCP sockets.")
Signed-off-by: Xueming Feng <kuro at kuroa.me>
Tested-by: Lorenzo Colitti <lorenzo at google.com>
Reviewed-by: Jason Xing <kerneljasonxing at gmail.com>
Reviewed-by: Eric Dumazet <edumazet at google.com>
Link: https://patch.msgid.link/20240826102327.1461482-1-kuro@kuroa.me
Signed-off-by: Jakub Kicinski <kuba at kernel.org>
(backported from commit bac76cf89816bff06c4ec2f3df97dc34e150a1c4)
Signed-off-by: Stav Aviram <saviram at nvidia.com>
[The conflict arose due to differences in error handling and logging
around tcp_send_active_reset(). The if (!sock_flag(sk, SOCK_DEAD)) check
was removed as in the upstream, while preserving the surrounding logic
from HEAD. In addition, !has_current_bpf_ctx() was replaced with
!current->bpf_ctx for compatibility, as the helper is unavailable in
this kernel version.]
---
 net/ipv4/tcp.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c1e624ca6a25..200dedcb24ac 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4504,6 +4504,13 @@ int tcp_abort(struct sock *sk, int err)
      /* Don't race with userspace socket closes such as tcp_close. */
      lock_sock(sk);

+     /* Avoid closing the same socket twice. */
+     if (sk->sk_state == TCP_CLOSE) {
+           if (!current->bpf_ctx)
+                 release_sock(sk);
+           return -ENOENT;
+     }
+
      if (sk->sk_state == TCP_LISTEN) {
            tcp_set_state(sk, TCP_CLOSE);
            inet_csk_listen_stop(sk);
@@ -4513,15 +4520,12 @@ int tcp_abort(struct sock *sk, int err)
      local_bh_disable();
      bh_lock_sock(sk);

-     if (!sock_flag(sk, SOCK_DEAD)) {
-           if (tcp_need_reset(sk->sk_state))
-                 tcp_send_active_reset(sk, GFP_ATOMIC);
-           tcp_done_with_error(sk, err);
-     }
+     if (tcp_need_reset(sk->sk_state))
+           tcp_send_active_reset(sk, GFP_ATOMIC);
+     tcp_done_with_error(sk, err);

      bh_unlock_sock(sk);
      local_bh_enable();
-     tcp_write_queue_purge(sk);
      release_sock(sk);
      return 0;
 }
--
2.34.1
________________________________
From: Stav Aviram <saviram at nvidia.com>
Sent: Wednesday, July 2, 2025 7:24 PM
To: kernel-team at lists.ubuntu.com <kernel-team at lists.ubuntu.com>
Cc: Aya Levin <ayal at nvidia.com>; Valentine Fatiev <valentinef at nvidia.com>
Subject: [SRU][J:linux-bluefield][PATCH v1 0/1] tcp: fix forever orphan socket caused by tcp_abort

From 1f314ba2e1ce988e4e91e6944e75bb9de3535cf9 Mon Sep 17 00:00:00 2001
Message-Id: <cover.1750858824.git.saviram at nvidia.com>
From: Stav Aviram <saviram at nvidia.com>
Date: Wed, 25 Jun 2025 16:40:24 +0300
To: kernel-team at lists.ubuntu.com
Subject: [SRU][J:linux-bluefield][PATCH v1 0/1] tcp: fix forever orphan socket caused by tcp_abort

BugLink: https://bugs.launchpad.net/bugs/2114965

SRU Justification:

[Impact]
In BFB version DOCA_2.6.0_BSP_4.6.0_Ubuntu_22.04-2.20240114, container
deletion via removal of its kubelet YAML from /etc/kubelet.d sometimes
fails to complete. The process waits for the container to disappear from
crictl ps, but the container remains in Running state indefinitely. This
behavior is seen with container version 2.dev.50 and FW 32.40.0324.
The issue appears to stem from a kernel bug affecting orphaned TCP
sockets stuck in a zero-window state. These sockets are not closed and
timers are not rescheduled, leading to "forever orphan" behavior that
prevents resource cleanup.

[Fix]
Backporting the upstream commit:
bac76cf89816bff06c4ec2f3df97dc34e150a1c4 ("tcp: fix forever orphan socket caused by tcp_abort")
This commit removes a conditional check on SOCK_DEAD in tcp_abort,
allowing proper closure of orphaned sockets and preventing indefinite
stalling. Backporting is needed as the error handling and logging
methods differ from the original upstream code.

[Test Case]
Compile tested on linux-bluefield-5.15 on the master-next branch.
Further testing includes reproducing the issue by removing the pod's
YAML from /etc/kubelet.d and monitoring container termination using
crictl ps. With the patch applied, the container should no longer
remain stuck in Running state.

[Regression Potential]
The patch targets a specific edge case in TCP socket handling, and after
backporting, it is as close as possible to the original upstream commit.
However, since the change removes a check that previously avoided
closing SOCK_DEAD sockets, there's a small risk if other kernel paths
still rely on the earlier behavior. This could theoretically lead to
unexpected side effects in force-close logic if assumptions about socket
state are violated. Also, the backport is not an absolute match for the
original commit, and so there's a possibility for unexpected behavior in
edge cases related to socket teardown.

Xueming Feng (1):
  tcp: fix forever orphan socket caused by tcp_abort

 net/ipv4/tcp.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

--
2.34.1

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/kernel-team/attachments/20250702/e92da7d0/attachment-0001.html>