[Bug 2015533] Re: Loss of network connectivity after upgrading dpdk packages from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Mon May 15 07:57:56 UTC 2023

We encountered the issue with OVS using the wrong (non-DPDK) binary but
applied the update-alternatives workaround as soon as we noticed it,
then we restarted OVS and the issue with not traffic was still there
despite OVS being able to initialize DPDK ports.

We also tried rebooting two hosts, and after reboots, OVS was reporting
DPDK support but the issue with no traffic and segfaults was still there
since DPDK 17.11.10-0ubuntu0.2 were still present.

We were also able to reproduce the issue on the next day by running:
0. Rollback of DPDK upgrade: downgraded from 17.11.10-0ubuntu0.2 to 17.11.10-0ubuntu0.1 done one day earlier to recover from the outage.
1. OVS-DPDK 2.9.8-0ubuntu0.18.04.1 + DPDK 17.11.10-0ubuntu0.1 running with DPDK support enabled in OVS. Network connectivity is confirmed to work.
2. Upgrade DPDK packages only to 17.11.10-0ubuntu0.2 on one host. Loss of network connectivity, despite DPDK support still being present in OVS (`ovs-vsctl list open_vswitch` showed dpdk support and `ovs-vsctl show` looked good).
3. Downgrade only DPDK packages to 17.11.10-0ubuntu0.1. Connectivity restored.

The kernel version in one of the VMs was 4.15.0-121-generic (ubuntu) but
we also saw the issue with non-Ubuntu VMs.

This environment uses jumbo frames on physical and vhostuer ports, is it possible that patch [0] changed something in how such packets are handled? Especially this part of the commit message:
> This patch also has the advantage of requesting the exact packets sizes for the mbufs.

[0]
http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to openvswitch in Ubuntu.
https://bugs.launchpad.net/bugs/2015533

Title:
  Loss of network connectivity after upgrading dpdk packages from
  17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2 on bionic

Status in dpdk package in Ubuntu:
  Incomplete
Status in openvswitch package in Ubuntu:
  Invalid

Bug description:
  We upgraded the following packages on a number of hosts running on bionic-queens:
  * dpdk packages (dpdk and librte_*) from 17.11.10-0ubuntu0.1 to 17.11.10-0ubuntu0.2
  * openvswitch-switch and openvswitch-switch-dpdk from 2.9.5-0ubuntu0.18.04.1 to 2.9.8-0ubuntu0.18.04.4

  It was just a plain `apt dist-upgrade` which upgraded a number of
  other packages - I can provide a full list of upgraded packages if
  needed.

  This resulted in a complete dataplane outage on a production cloud.

  Symptoms:

  1. Loss of network connectivity on virtual machines using
  dpdkvhostuser ports.

  VMs were unable to send any packets. Using `virsh console` we observed
  the following line printed a few times per second:

  net eth0: unexpected txq (0) queue failure: -5

  At the same time we also observed the following messages in OVS logs:

  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00727|dpdk|ERR|VHOST_CONFIG: recvmsg failed
  Apr 06 13:45:27 brtlvmrs0613co ovs-vswitchd[45321]: ovs|00732|dpdk|ERR|VHOST_CONFIG: recvmsg failed

  rx_* counters on the vhu port in OVS (rx from ovs = tx from VM's point
  of view) were not increasing.

  2. Segmentation faults in ovs/dpdk libraries.

  This was another symptom. After restarting ovs it would run fine for a while but would crash after approx. 5-60 minutes on upgraded hosts.
  There were no logs from the ovs itself that would show the crash, the only output was always a single line in dmesg, examples:

  [22985566.641329] ovs-vswitchd[55077]: segfault at 0 ip 00007f3b570ad7a5 sp 00007f3b41b59660 error 6 in librte_eal.so.17.11[7f3b57094000+26000]
  [22996115.925645] ovs-vswitchd[10442]: segfault at 0 ip 00007fd4065617a5 sp 00007fd3f0eb7660 error 6 in librte_eal.so.17.11[7fd406548000+26000]

  Or on another host:
  [22994791.103748] ovs-vswitchd[41066]: segfault at 0 ip 00007ff937ba27a5 sp 00007ff922ffc660 error 6 in librte_eal.so.17.11[7ff937b89000+26000]
  [22995667.342714] ovs-vswitchd[56761]: segfault at 0 ip 00007feb1fe10740 sp 00007feb0ab5b530 error 6 in librte_eal.so.17.11[7feb1fdf7000+26000]
  [22996548.675879] ovs-vswitchd[30376]: segfault at 0 ip 00007f077a11d7a5 sp 00007f0768eb4660 error 6 in librte_eal.so.17.11[7f077a104000+26000]
  [23002220.725328] pmd6[33609]: segfault at 2 ip 00007f0cfa00700e sp 00007f0ce7b5be80 error 4 in librte_vhost.so.17.11[7f0cf9ff9000+14000]
  [23004983.523060] pmd7[79951]: segfault at e5c ip 00007fdd500807de sp 00007fdd41101c80 error 4 in librte_vhost.so.17.11[7fdd50075000+14000]
  [23005350.737746] pmd6[17073]: segfault at 2 ip 00007fe9718df00e sp 00007fe9635ffe80 error 4 in librte_vhost.so.17.11[7fe9718d1000+14000]
  [  639.857893] ovs-vswitchd[4106]: segfault at 0 ip 00007f8e3227d7a5 sp 00007f8e14eb7660 error 6 in librte_eal.so.17.11[7f8e32264000+26000]
  [ 2208.666437] pmd6[11788]: segfault at 2 ip 00007ff2e941100e sp 00007ff2db131e80 error 4 in librte_vhost.so.17.11[7ff2e9403000+14000]
  [ 2966.124634] pmd6[48678]: segfault at 2 ip 00007feed53bd00e sp 00007feec70dde80 error 4 in librte_vhost.so.17.11[7feed53af000+14000]
  [ 4411.636755] pmd6[28285]: segfault at 2 ip 00007fcae075d00e sp 00007fcad247de80 error 4 in librte_vhost.so.17.11[7fcae074f000+14000]

  This was sort of "stabilized" by full restart of OVS and neutron
  agents and not touching any VMs but on one machine we still saw
  librte_vhost.so segfaults. But even without segfaults we still faced
  the issue with "net eth0: unexpected txq (0) queue failure: -5" and
  didn't have working connectivity.

  The issue was also easy to trigger by attempting a live migration of a
  VM that was using a vhu port although it was also crashing randomly on
  its own.

  Failed attempts to restore the dataplane included:
  1. Restart of ovs and neutron agents.
  2. Restart of ovs and neutron agents, restart of libvirtd, nova-compute and hard reboot of VMs.
  3. Reboot of the hosts.
  4. Rollback of ovs packages to 2.9.5 without rolling back dpdk/librte_* pacakges.

  Solution:

  After analyzing the diff between dpdk 17.11.10-0ubuntu0.1 and
  17.11.10-0ubuntu0.2 packages [0] we decided to perform a rollback by
  manually reinstalling 17.11.10-0ubuntu0.1 versions of dpdk/librte_*
  debs (63 packages in total). Full list of rolled back packages: [1]

  Please note that we also re-installed the latest available OVS (both
  openvswitch-switch and openvswitch-switch-dpdk) version before rolling
  back dpdk: 2.9.8-0ubuntu0.18.04.4.

  Actions taken after the downgrade:
  1. Stopped all VMs.
  2. Restarted OVS.
  3. Restarted neutron agents.
  4. Started all VMs.

  Rollback of 63 dpdk/librte_* packages and service restarts were the
  only actions that we needed to restore the connectivity on all
  machines. Error messages disappeared from VMs' console log (no more
  "net eth0: unexpected txq (0) queue failure: -5"). OVS started to
  report rx_* counters rising on vhu ports. Segmentation faults from ovs
  and pmd have stopped as well.

  [0] http://launchpadlibrarian.net/623207263/dpdk_17.11.10-0ubuntu0.1_17.11.10-0ubuntu0.2.diff.gz
  [1] https://pastebin.ubuntu.com/p/Fx9dpQZwqM/

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/2015533/+subscriptions