[Bug 1940043] Re: Upgrade from OVN 20.03 to newer OVN version will cause data plane outage
Frode Nordahl
1940043 at bugs.launchpad.net
Sun Nov 6 12:07:47 UTC 2022
Package versions before we start:
$ juju run --application ovn-central 'dpkg -l |grep ovn'
- Stdout: |
ii ovn-central 20.03.2-0ubuntu0.20.04.4 amd64 OVN central components
ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
UnitId: ovn-central/0
- Stdout: |
ii ovn-central 20.03.2-0ubuntu0.20.04.4 amd64 OVN central components
ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
UnitId: ovn-central/1
- Stdout: |
ii ovn-central 20.03.2-0ubuntu0.20.04.4 amd64 OVN central components
ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
UnitId: ovn-central/2
$ juju run --application ovn-chassis 'dpkg -l |grep ovn'
- Stdout: |
ii neutron-ovn-metadata-agent 2:16.4.2-0ubuntu4 all Neutron is a virtual network service for Openstack - OVN metadata agent
ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
ii ovn-host 20.03.2-0ubuntu0.20.04.4 amd64 OVN host components
UnitId: ovn-chassis/0
- Stdout: |
ii neutron-ovn-metadata-agent 2:16.4.2-0ubuntu4 all Neutron is a virtual network service for Openstack - OVN metadata agent
ii ovn-common 20.03.2-0ubuntu0.20.04.4 amd64 OVN common components
ii ovn-host 20.03.2-0ubuntu0.20.04.4 amd64 OVN host components
UnitId: ovn-chassis/1
Ping running instances:
$ ping 10.78.95.55
PING 10.78.95.55 (10.78.95.55) 56(84) bytes of data.
64 bytes from 10.78.95.55: icmp_seq=1 ttl=63 time=1.80 ms
64 bytes from 10.78.95.55: icmp_seq=2 ttl=63 time=1.22 ms
64 bytes from 10.78.95.55: icmp_seq=3 ttl=63 time=1.06 ms
...
$ ping 10.78.95.162
PING 10.78.95.162 (10.78.95.162) 56(84) bytes of data.
64 bytes from 10.78.95.162: icmp_seq=1 ttl=63 time=1.08 ms
64 bytes from 10.78.95.162: icmp_seq=2 ttl=63 time=0.545 ms
64 bytes from 10.78.95.162: icmp_seq=3 ttl=63 time=0.516 ms
...
Ensure OVN DNS interception/resolution is enabled and working:
ubuntu at zaza-neutrontests-ins-1:~$ dig zaza-neutrontests-ins-2 @10.78.95.1
...
;; ADDITIONAL SECTION:
zaza-neutrontests-ins-2. 3600 IN A 192.168.0.180
Ensure ovn-controllers picks up version mismatch prior to upgrade:
Note that the backported version mismatch handling code does not have the
additional version mismatch check in the incremental processing engine that
later versions have, this means that we need to ensure the main loop version
mismatch check has run prior to allowing northd to fill database tables after
an upgrade. We will just have to deal with this in the charms and/or as part
of upgrade documentation.
Force the version mismatch to happen before we actually perform the upgrade to
ensure the ovn-controller does not make any mistakes.
$ juju run --application ovn-central 'systemctl stop ovn-northd; systemctl mask ovn-northd'
$ juju run --unit ovn-central/0 'ovn-sbctl set sb-global . options:northd_internal_version="20.03.2-2.7.0-42.1"'
$ juju run --application ovn-chassis 'tail -1 /var/log/ovn/ovn-controller.log'
- Stdout: |
2022-11-06T11:04:26.305Z|00017|main|WARN|controller version - 20.03.2-2.7.0-42.0 mismatch with northd version - 20.03.2-2.7.0-42.1
UnitId: ovn-chassis/0
- Stdout: |
2022-11-06T11:04:26.272Z|00021|main|WARN|controller version - 20.03.2-2.7.0-42.0 mismatch with northd version - 20.03.2-2.7.0-42.1
UnitId: ovn-chassis/1
Confirm OVN DNS interception/resolution is still working:
ubuntu at zaza-neutrontests-ins-1:~$ dig zaza-neutrontests-ins-2 @10.78.95.1
...
;; ADDITIONAL SECTION:
zaza-neutrontests-ins-2. 3600 IN A 192.168.0.180
Upgrade packages on central units:
$ juju config ovn-central ovn-source=cloud:focal-ovn-22.03
$ juju run --application ovn-central 'systemctl unmask ovn-northd;
systemctl restart ovn-northd'
Confirm instances are still responding:
64 bytes from 10.78.95.55: icmp_seq=1167 ttl=63 time=1.31 ms
64 bytes from 10.78.95.55: icmp_seq=1168 ttl=63 time=1.06 ms
64 bytes from 10.78.95.55: icmp_seq=1169 ttl=63 time=1.07 ms
...
64 bytes from 10.78.95.162: icmp_seq=1145 ttl=63 time=0.569 ms
64 bytes from 10.78.95.162: icmp_seq=1146 ttl=63 time=0.564 ms
64 bytes from 10.78.95.162: icmp_seq=1147 ttl=63 time=0.937 ms
...
Confirm OVN DNS interception/resolution is still working:
ubuntu at zaza-neutrontests-ins-1:~$ dig zaza-neutrontests-ins-2 @10.78.95.1
...
;; ADDITIONAL SECTION:
zaza-neutrontests-ins-2. 3600 IN A 192.168.0.180
Upgrade packages on chassis units:
$ juju config ovn-chassis ovn-source=cloud:focal-ovn-22.03
Collect instance ping statistics after completing the upgrade:
64 bytes from 10.78.95.55: icmp_seq=1299 ttl=63 time=1.02 ms
64 bytes from 10.78.95.55: icmp_seq=1300 ttl=63 time=1.04 ms
64 bytes from 10.78.95.55: icmp_seq=1301 ttl=63 time=1.01 ms
^C
--- 10.78.95.55 ping statistics ---
1301 packets transmitted, 1293 received, 0.614912% packet loss, time 1301821ms
rtt min/avg/max/mdev = 0.871/1.296/28.365/1.310 ms
64 bytes from 10.78.95.162: icmp_seq=1264 ttl=63 time=0.642 ms
64 bytes from 10.78.95.162: icmp_seq=1265 ttl=63 time=0.711 ms
64 bytes from 10.78.95.162: icmp_seq=1266 ttl=63 time=0.564 ms
^C
--- 10.78.95.162 ping statistics ---
1266 packets transmitted, 1261 received, 0.394945% packet loss, time 1292577ms
rtt min/avg/max/mdev = 0.434/0.677/28.644/0.921 ms
** Tags removed: verification-needed verification-needed-focal
** Tags added: verification-done verification-done-focal
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to ovn in Ubuntu.
https://bugs.launchpad.net/bugs/1940043
Title:
Upgrade from OVN 20.03 to newer OVN version will cause data plane
outage
Status in charm-layer-ovn:
Fix Released
Status in charm-ovn-chassis:
Fix Released
Status in charm-ovn-dedicated-chassis:
Fix Released
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive wallaby series:
Triaged
Status in ovn package in Ubuntu:
Fix Released
Status in ovn source package in Focal:
Fix Committed
Status in ovn source package in Hirsute:
Won't Fix
Status in ovn source package in Impish:
Fix Released
Bug description:
[Impact]
When upgrading from OVN 20.03, as made available in Ubuntu Focal, to a newer version of OVN, it is currently not possible to upgrade without causing a data plane outage.
If the user attempts to upgrade the central components first, the ovn-
controller will tear down connectivity to running instances as it may
not fully understand the data structure of a newer database.
If the user attempts to upgrade the ovn-controler first, recent
releases are not guaranteed to understand the older database and
connectivity may remain down until all hypervisors and central
components have been upgraded.
If the user attempts to manually stop the ovn-controller during the
upgrade to avoid it inadvertently tearing down connectivity on central
component upgrade, cloud instances will be deprived of vital services
such as DNS lookup and DHCP.
To fix this situation two changes are needed:
1) Backport of a upstream feature [0] that allows the ovn-controller to detect version mismatch and subsequently refrain from making further changes to the local Open vSwitch instance until the version mismatch is corrected.
2) Make ovn-controller not clear out runtime flow state in Open
vSwitch on exit by updating the ovn-controller systemd service to pass
the `--restart` argument when stopping the controller. This flag
tells the ovn-controller process that it should not clear out Open
vSwitch flows and OVN SB database records on exit, which allows
already installed state to continue operation until the new instance
of the ovn-controller process starts. [1][2][3]
It does not mean that the service will be restarted as opposed to
being stopped, as one might think based on the name of the argument.
This change serves two purposes:
2a) Allow upgrading the ovn-controller to a newer version than the
central components, while retaining connectivity to running instances
until the central components are upgraded.
2b) Minimize the downtime on package upgrade.
[Test Plan]
1. Deploy OpenStack Ussuri from the Focal archive.
2. Launch and instance and confirm connectivity.
3. Add UCA or other PPA with a newer version of OVN and perform upgrade of the OVN components on relevant units in the deployment.
4. Confirm how new version of central components make the ovn-controller log version mismatch as well as show continued connectivity to the test instance.
5. Upgrade data plane units and confirm how the version mismatch situation is resolved and at the same time instances retain connectivity with minimal downtime during the upgrade.
[Regression Potential]
The backported feature is optional and enabled by specifically
entering a key-value pair into the local Open vSwitch database to
enable it. It has also been available upstream for several releases.
The change to the ovn-controller systemd service has been in Ubuntu
since Impish [3] and we have had no reports of side effects of this
change.
[Original Bug Description]
The upstream recommendation for upgrades of OVN is to first upgrade the data plane components (chassis aka. ovn-controller), and then upgrade the central components (the database schema and ovn-northd). The rationale for this is that the new version of the ovn-controller is required to cope with any changes to database schema or how northd programs flows.
However, during the course of rapid OVN development there has also
been introduced changes that make the new ovn-controller not cope with
a old database schema, breaking the recommended upgrade procedure.
To cope with this upstream has introduced a new optional configuration
for the ovn-controller that allows it to detect version
inconsistencies, and when they are present stop it from making changes
to the data plane until the version inconsistency is resolved [0].
For the above mentioned configuration to be effective we also need the
package to call ``ovn-ctl stop_controller`` with the --restart option
so that the ovn-controller does not flush the installed flows on exit.
We should make required changes to packages and charms to allow
upgrades to progress with less data plane outage.
0: https://github.com/ovn-org/ovn/commit/1dd27ea7aea40122c1edbff845e14abaa70c0413
1: https://github.com/ovn-org/ovn/commit/f508fcc14abfaaa13e9f1bf3b5b6bac59bd27a5f
2: https://github.com/ovn-org/ovn/commit/45c7a85dc7f2af56191a47f1357d16b8af618e20
3: https://git.launchpad.net/~ubuntu-server-dev/ubuntu/+source/ovn/commit/debian/ovn-host.ovn-controller.service?id=3c601ecc13724d3f13ec0cc989f6ffd838f787f8
To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-layer-ovn/+bug/1940043/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list