[Bug 1873952] Re: Call trace during manual controller reset on NVMe/RoCE array

Jennifer Duong 1873952 at bugs.launchpad.net
Mon Mar 29 22:36:50 UTC 2021


This call trace is also seen while manually resetting a NVIDIA Mellanox
InfiniBand Switch that is connected to an NVMe/IB EF600 storage array.
The server has an MCX354A-FCBT installed running with FW 2.42.5000. The
system is connected to a QM8700 and SB7800. Both switches are running
with MLNX OS 3.9.2110. The message logs have been attached.

** Attachment added: "ICTM1605S01H4-switch-port-fail"
   https://bugs.launchpad.net/ubuntu/+source/nvme-cli/+bug/1873952/+attachment/5482212/+files/ICTM1605S01H4-switch-port-fail

** Summary changed:

- Call trace during manual controller reset on NVMe/RoCE array
+ Call trace during manual controller reset on NVMe/RoCE array and switch reset on NVMe/IB array

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to nvme-cli in Ubuntu.
https://bugs.launchpad.net/bugs/1873952

Title:
  Call trace during manual controller reset on NVMe/RoCE array and
  switch reset on NVMe/IB array

Status in nvme-cli package in Ubuntu:
  Confirmed

Bug description:
  After manually resetting one of my E-Series NVMe/RoCE controller, I
  hit the following call trace:

  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958231] workqueue: WQ_MEM_RECLAIM nvme-wq:nvme_rdma_reconnect_ctrl_work [nvme_rdma] is flushing !WQ_MEM_RECLAIM ib_addr:process_one_req [ib_core]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958244] WARNING: CPU: 11 PID: 6260 at kernel/workqueue.c:2610 check_flush_dependency+0x11c/0x140
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958245] Modules linked in: xfs nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache rpcrdma rdma_ucm ib_iser ib_umad libiscsi ib_ipoib scsi_transport_iscsi intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm intel_cstate intel_rapl_perf joydev input_leds dcdbas mei_me mei ipmi_si ipmi_devintf ipmi_msghandler mac_hid acpi_power_meter sch_fq_codel nvme_rdma rdma_cm iw_cm ib_cm nvme_fabrics nvme_core sunrpc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs uas usb_storage ib_core hid_generic usbhid hid mgag200 crct10dif_pclmul drm_vram_helper crc32_pclmul i2c_algo_bit ttm ghash_clmulni_intel drm_kms_helper ixgbe aesni_intel syscopyarea sysfillrect mxm_wmi xfrm_algo sysimgblt crypto_simd mlx5_core fb_sys_fops dca cryptd drm glue_helper mdio pci_hyperv_intf ahci lpc_ich tg3
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958305]  tls libahci mlxfw wmi scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958315] CPU: 11 PID: 6260 Comm: kworker/u34:3 Not tainted 5.4.0-24-generic #28-Ubuntu
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958316] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.8.0 005/17/2018
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958321] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958326] RIP: 0010:check_flush_dependency+0x11c/0x140
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958329] Code: 8d 8b b0 00 00 00 48 8b 50 18 4d 89 e0 48 8d b1 b0 00 00 00 48 c7 c7 40 f8 75 9d 4c 89 c9 c6 05 f1 d9 74 01 01 e8 1f 14 fe ff <0f> 0b e9 07 ff ff ff 80 3d df d9 74 01 00 75 92 e9 3c ff ff ff 66
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958331] RSP: 0018:ffffb34bc4e87bf0 EFLAGS: 00010086
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958333] RAX: 0000000000000000 RBX: ffff946423812400 RCX: 0000000000000000
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958334] RDX: 0000000000000089 RSI: ffffffff9df926a9 RDI: 0000000000000046
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958336] RBP: ffffb34bc4e87c10 R08: ffffffff9df92620 R09: 0000000000000089
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958337] R10: ffffffff9df92a00 R11: 000000009df9268f R12: ffffffffc09be560
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958338] R13: ffff9468238b2f00 R14: 0000000000000001 R15: ffff94682dbbb700
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958341] FS:  0000000000000000(0000) GS:ffff94682fd40000(0000) knlGS:0000000000000000
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958344] CR2: 00007ff61cbf4ff8 CR3: 000000040a40a001 CR4: 00000000003606e0
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958345] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958347] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958348] Call Trace:
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958355]  __flush_work+0x97/0x1d0
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958360]  __cancel_work_timer+0x10e/0x190
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958368]  ? dev_printk_emit+0x4e/0x65
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958371]  cancel_delayed_work_sync+0x13/0x20
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958387]  rdma_addr_cancel+0x8a/0xb0 [ib_core]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958393]  cma_cancel_operation+0x72/0x1e0 [rdma_cm]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958398]  rdma_destroy_id+0x56/0x2f0 [rdma_cm]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958402]  nvme_rdma_alloc_queue.cold+0x28/0x5b [nvme_rdma]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958406]  nvme_rdma_setup_ctrl+0x37/0x720 [nvme_rdma]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958412]  ? try_to_wake_up+0x224/0x6a0
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958416]  nvme_rdma_reconnect_ctrl_work+0x27/0x40 [nvme_rdma]
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958419]  process_one_work+0x1eb/0x3b0
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958422]  worker_thread+0x4d/0x400
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958427]  kthread+0x104/0x140
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958430]  ? process_one_work+0x3b0/0x3b0
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958432]  ? kthread_park+0x90/0x90
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958439]  ret_from_fork+0x35/0x40
  Apr 20 14:08:24 ICTM1611S01H4 kernel: [  949.958442] ---[ end trace 859f78e32cc2aa61 ]---

  This seems to consistently occur on my direct-connect host and not my
  fabric-attached hosts. I'm running with Ubuntu 20.04 kernel-5.4.0-24.
  I have the following Mellanox cards installed:

  MCX416A-CCAT FW 12.27.1016
  MCX4121A-ACAT FW 14.27.1016

  ProblemType: Bug
  DistroRelease: Ubuntu 20.04
  Package: nvme-cli 1.9-1
  ProcVersionSignature: Ubuntu 5.4.0-24.28-generic 5.4.30
  Uname: Linux 5.4.0-24-generic x86_64
  ApportVersion: 2.20.11-0ubuntu27
  Architecture: amd64
  CasperMD5CheckResult: skip
  Date: Fri Apr 17 14:28:50 2020
  InstallationDate: Installed on 2020-04-15 (2 days ago)
  InstallationMedia: Ubuntu-Server 20.04 LTS "Focal Fossa" - Alpha amd64 (20200124)
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   XDG_RUNTIME_DIR=<set>
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  SourcePackage: nvme-cli
  UpgradeStatus: No upgrade log present (probably fresh install)
  modified.conffile..etc.nvme.hostnqn: ictm1611s01h4-hostnqn
  mtime.conffile..etc.nvme.hostnqn: 2020-04-15T13:43:48.076829

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/nvme-cli/+bug/1873952/+subscriptions



More information about the foundations-bugs mailing list