[PATCH 0/2][focal/linux-azure] Azure: Mellanox VF NIC crashes when removed
Tim Gardner
tim.gardner at canonical.com
Tue May 17 14:22:32 UTC 2022
BugLink: https://bugs.launchpad.net/bugs/1973758
SRU Justification
[Impact]
The 5.4.0-1075-azure and newer kernels are broken in that the VM can easily panic
when the Mellanox VF NIC is removed and added due to Azure host servicing events
or the below manual "unbind/bind" test (here the GUID can be different in
different VMs):
for i in `seq 1 1000`;
do
cd /sys/bus/vmbus/drivers/hv_pci;
echo abdc2107-402e-4704-8c88-c2b850696c3c > unbind;
echo abdc2107-402e-4704-8c88-c2b850696c3c > bind;
done
A sample panic call-trace is:
[ 107.359954] kernel BUG at /build/linux-azure-5.4-4I3kFs/linux-azure-5.4-5.4.0/mm/slub.c:4020!
[ 107.363858] invalid opcode: 0000 [#1] SMP NOPTI
[ 107.365870] CPU: 0 PID: 334 Comm: kworker/0:2 Not tainted 5.4.0-1077-azure #80~18.04.1-Ubuntu
[ 107.369589] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018
[ 107.373811] Workqueue: events vmbus_onmessage_work
[ 107.375909] RIP: 0010:kfree+0x1d2/0x240
…
[ 107.413789] Call Trace:
[ 107.414867] kobject_uevent_env+0x1b5/0x7e0
[ 107.416747] kobject_uevent+0xb/0x10
[ 107.418327] device_release_driver_internal+0x191/0x1c0
[ 107.420653] device_release_driver+0x12/0x20
[ 107.422523] bus_remove_device+0xe1/0x150
[ 107.424279] device_del+0x167/0x380
[ 107.425824] device_unregister+0x1a/0x60
[ 107.427536] vmbus_device_unregister+0x27/0x50
[ 107.429528] vmbus_onoffer_rescind+0x1d0/0x1f0
[ 107.431474] vmbus_onmessage+0x2c/0x70
[ 107.433104] vmbus_onmessage_work+0x22/0x30
[ 107.434919] process_one_work+0x209/0x400
[ 107.436661] worker_thread+0x34/0x40
It turns out there is a bug in
https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/bionic/commit/?id=16a3c750a78d8,
which misses the second hunk of the upstream patch
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=877b911a5ba0.
Please apply the below patch to fix the issue:
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -3653,7 +3653,7 @@ static int hv_pci_remove(struct hv_device *hdev)
hv_put_dom_num(hbus->bridge->domain_nr);
- free_page((unsigned long)hbus);
+ kfree(hbus);
return ret;
}
BTW, please apply this patch as well (Note: this patch is not really required as
it's only for error handling path, which is usually unlikely):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=42c3d41832ef4fcf60aaa6f748de01ad99572adf
[Test Case]
Microsoft tested
[Other Info]
SF: #00336939
More information about the kernel-team
mailing list