[SRU][J:linux-bluefield][PATCH v1 0/3] UBUNTU: SAUCE: Revert "can: gw: fix RCU/BH usage in cgw_create_job()"

Stav Aviram saviram at nvidia.com
Thu Jul 17 14:27:01 UTC 2025


BugLink: https://bugs.launchpad.net/bugs/2117163

SRU Justification:

[Impact]
In Ubuntu-bluefield-5.15.0-1071.73, which included commits from upstream
stable version 5.15.183, the system crashes after building the kernel,
building OFED driver and restarting the driver:

Oops: 0000 [#1] SMP NOPTI
Workqueue: events kfree_rcu_work
RIP: 0010:kmem_cache_free_bulk+0x137/0x1d0
Call Trace:
 kfree_rcu_work+0x1e7/0x250
 process_one_work+0x1b0/0x350
 worker_thread+0x50/0x3a0
 kthread+0x124/0x150
 ret_from_fork+0x1f/0x30

The crash is caused by using k[v]free_rcu_mightsleep() functions, that
were introduced by the faulty commit 5dc583481a0a ("Add
kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()").  This commit
introduces new mightsleep functions but lacks critical infrastructure
changes required for proper operation.  Our analysis indicates the root
cause is an incomplete API migration, which causes mightsleep macros to
pass void pointers where rcu_callback_t function pointers are expected:
BF5.15 (broken): void kvfree_call_rcu(struct rcu_head *head,
rcu_callback_t func) Required: void kvfree_call_rcu(struct rcu_head
*head, void *ptr) This results in invalid pointer arithmetic that
generates tiny memory addresses (like 0x17) which crash the kernel when
freed.

[Fix]
Phase 1 (Immediate):
Revert the problematic commit to restore stability, along with the two
other commits from the same series:
* 57818f6fec6c ("can: gw: fix RCU/BH usage in cgw_create_job()")
* 5dc583481a0a ("rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()") (main problematic commit)
* 82683fabcb28 ("can: gw: use call_rcu() instead of costly synchronize_rcu()")

Phase 2 (Proper Implementation):
The results of our research should be verified and applied into Jammy to
enable proper *_mightsleep() support for OFED driver.  The most critical
commit to verify and apply is the upstream commit introducing the
kvfree_call_rcu() signature transformation:
* 04a522b7da3d ("rcu: Refactor kvfree_call_rcu() and high-level helpers")
Additionally, the following commits should be examined to determine
whether they are essential for avoiding future issues:
* 7e3f926bf453 ("rcu/kvfree: Eliminate k[v]free_rcu() single argument macro")
* 5da7cb193db3 ("rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period")
* 23532061ad30 ("net/mlx5: Rename kfree_rcu() to kfree_rcu_mightsleep()")
A deeper investigation should also be conducted to ensure no additional
crucial commits are required for proper integration of this feature into
Jammy.  Once all necessary commits are backported, the *_mightsleep()
functions can be safely re-introduced into Jammy.

[Test Case]
Phase 1:
After reverting the three commits mentioned above, the compilation
completed successfully on the master-next branch.  After reverting,
compiling the kernel, rebooting, building OFED and restarting the
driver, no crash occurred.

Phase 2:
After applying all required infrastructure commits and re-adding
mightsleep functions, system should remain stable when building OFED and
restarting.

[Regression Potential]
Phase 1 (Revert): 
Very low risk. Simply removes the problematic new functionality and
returns to the stable state that existed before the faulty commit.

Phase 2 (Proper implementation):
Medium risk as it requires backporting multiple upstream RCU
infrastructure changes to an older kernel base.

Stav Aviram (3):
  UBUNTU: SAUCE: Revert "can: gw: fix RCU/BH usage in cgw_create_job()"
  UBUNTU: SAUCE: Revert "rcu/kvfree: Add kvfree_rcu_mightsleep() and
    kfree_rcu_mightsleep()"
  UBUNTU: SAUCE: Revert "can: gw: use call_rcu() instead of costly
    synchronize_rcu()"

 include/linux/rcupdate.h |   3 -
 net/can/gw.c             | 165 +++++++++++++++------------------------
 2 files changed, 65 insertions(+), 103 deletions(-)

-- 
2.38.1




More information about the kernel-team mailing list