[Bug 1460164] Re: restart of openvswitch-switch causes instance network down when l2population enabled

Tue Apr 12 14:15:29 UTC 2016

Reviewed:  https://review.openstack.org/272643
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b79ed67ef703be3a034ed0cf95d401b0711dae46
Submitter: Jenkins
Branch:    stable/kilo

commit b79ed67ef703be3a034ed0cf95d401b0711dae46
Author: James Page <james.page at ubuntu.com>
Date:   Fri Dec 18 15:02:11 2015 +0000

    Ensure that tunnels are fully reset on ovs restart

    When the l2population mechanism driver is enabled, if ovs is restarted
    tunnel ports are not re-configured in full due to stale ofport handles
    in the OVS agent.

    Reset all handles when OVS is restarted to ensure that tunnels are
    fully recreated in this situation.

    Change-Id: If0e034a034a7f000a1c58aa8a43d2c857dee6582
    Closes-bug: #1460164
    (cherry picked from commit 17c14977ce0e2291e911739f8c85838f1c1f3473)

** Tags added: in-stable-kilo

-- 
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to neutron in Ubuntu.
https://bugs.launchpad.net/bugs/1460164

Title:
  restart of openvswitch-switch causes instance network down when
  l2population enabled

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive icehouse series:
  Fix Released
Status in Ubuntu Cloud Archive juno series:
  New
Status in Ubuntu Cloud Archive kilo series:
  Fix Released
Status in neutron:
  Fix Released
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Trusty:
  Fix Released
Status in neutron source package in Wily:
  Fix Released
Status in neutron source package in Xenial:
  Fix Released

Bug description:
  [Impact]
  Restarts of openvswitch (typically on upgrade) result in loss of tunnel connectivity when the l2population driver is in use.  This results in loss of access to all instances on the effected compute hosts

  [Test Case]
  Deploy cloud with ml2/ovs/l2population enabled
  boot instances
  restart ovs; instance connectivity will be lost until the neutron-openvswitch-agent is restarted on the compute hosts.

  [Regression Potential]
  Minimal - in multiple stable branches upstream.

  [Original Bug Report]
  On 2015-05-28, our Landscape auto-upgraded packages on two of our
  OpenStack clouds.  On both clouds, but only on some compute nodes, the
  upgrade of openvswitch-switch and corresponding downtime of
  ovs-vswitchd appears to have triggered some sort of race condition
  within neutron-plugin-openvswitch-agent leaving it in a broken state;
  any new instances come up with non-functional network but pre-existing
  instances appear unaffected.  Restarting n-p-ovs-agent on the affected
  compute nodes is sufficient to work around the problem.

  The packages Landscape upgraded (from /var/log/apt/history.log):

  Start-Date: 2015-05-28  14:23:07
  Upgrade: nova-compute-libvirt:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), libsystemd-login0:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), nova-compute-kvm:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), systemd-services:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), isc-dhcp-common:amd64 (4.2.4-7ubuntu12.1, 4.2.4-7ubuntu12.2), nova-common:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), python-nova:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), libsystemd-daemon0:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), grub-common:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), libpam-systemd:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), udev:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), grub2-common:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), openvswitch-switch:amd64 (2.0.2-0ubuntu0.14.04.1, 2.0.2-0ubuntu0.14.04.2), libudev1:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), isc-dhcp-client:amd64 (4.2.4-7ubuntu12.1, 4.2.4-7ubuntu12.2), python-eventlet:amd64 (0.13.0-1ubuntu2, 0.13.0-1ubuntu2.1), python-novaclient:amd64 (2.17.0-0ubuntu1.1, 2.17.0-0ubuntu1.2), grub-pc-bin:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), grub-pc:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), nova-compute:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), openvswitch-common:amd64 (2.0.2-0ubuntu0.14.04.1, 2.0.2-0ubuntu0.14.04.2)
  End-Date: 2015-05-28  14:24:47

  From /var/log/neutron/openvswitch-agent.log:

  2015-05-28 14:24:18.336 47866 ERROR neutron.agent.linux.ovsdb_monitor
  [-] Error received from ovsdb monitor: ovsdb-client:
  unix:/var/run/openvswitch/db.sock: receive failed (End of file)

  Looking at a stuck instances, all the right tunnels and bridges and
  what not appear to be there:

  root at vector:~# ip l l | grep c-3b
  460002: qbr7ed8b59c-3b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
  460003: qvo7ed8b59c-3b: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP mode DEFAULT group default qlen 1000
  460004: qvb7ed8b59c-3b: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UP mode DEFAULT group default qlen 1000
  460005: tap7ed8b59c-3b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UNKNOWN mode DEFAULT group default qlen 500
  root at vector:~# ovs-vsctl list-ports br-int | grep c-3b
  qvo7ed8b59c-3b
  root at vector:~#

  But I can't ping the unit from within the qrouter-${id} namespace on
  the neutron gateway.  If I tcpdump the {q,t}*c-3b interfaces, I don't
  see any traffic.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1460164/+subscriptions