[Bug 2017748] Re: [SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues
Matthew Ruffell
2017748 at bugs.launchpad.net
Wed Jun 4 02:54:07 UTC 2025
** Description changed:
[Impact]
- ovnmeta- namespaces are missing intermittently then can't reach to VMs.
+ During scalability tests where extreme load is generated by creating thousands
+ of VMs all at the same time, some VMs fail to get a DHCP lease and cannot be
+ pinged or sshed to after deployment.
- The ovn metadata namespace may be missing intermittently under certain
- conditions, such as high load. This prevents VMs from retrieving
- metadata (e.g., ssh keys), making them unreachable. The issue is not
- easily reproducible.
+ The ovnmeta namespaces for networks that the VMs were created in are missing.
+ The following lines are present in neutron-ovn-metadata-agent.log:
+
+ 2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
+ 2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
+
+ What is happening is that under extreme load, sometimes the metadata port
+ information has not been propagated by OVN to the Southbound database, which
+ usually takes the form of a update notification, and when
+ PortBindingChassisEvent event is triggered in ovn-metadata-agent, it only looks
+ for update notifications, finds none, so it doesn't know any metadata port or IP
+ information, fails, logs the message above, and tears down the ovnmetadata
+ namespace for that VM.
+
+ Eventually ovsdb-server catches up, and merges insert and update notifications
+ and sends them out as a insert notification, which PortBindingChassisEvent
+ currently ignores, and the metadata is never applied to the VM.
+
+ This is a race condition, and it doesn't happen when under normal conditions,
+ as the metadata would just be delivered a update notification.
+
+ The fix is to also listen for insert notifications, and act on them.
[Test Case]
- This issue is theoretically reproducible under certian condistions, such
- as high load. Howevr, in practice, it has proven extremely difficult to
- reproduce.
+ This can't be reproduced in the lab, even after many attempts.
- I first talked with the fix author, Brian, who confirmed that he does
- not have a reprodcer. I then did almost 10 tests attempts to reproduce
- the issue, but was unsuccefully, pls refer to this pastebin for more
- details - https://paste.ubuntu.com/p/H6vh8jycvC/
+ A user sees this issue daily in production, where they run a scalability test
+ every night, in which they create a new tenant, create all necessary resources
+ (networks, subnets, routers, load balancers, etc.) and start several thousand
+ VMs. They then audit the deployment and verify that everything deployed
+ correctly.
- I can't find a reproducer so I have also done further research into how
- the issue was reproduced on the customer's side. The customer runs they
- nightly scripts few times a day which basically create a tenant and all
- resources (networks, subnets, routers, vms, lbs, etc.) and verifies of
- they were created correctly. Intermittently customer notices that VM is
- unreachable after creation, and in the last two use cases, they saw that
- the ovn namespace was missing on the compute host, due to this even
- though the VM was created, metadata URL is not reachable. We do see
- similar logs in the customer's env, such as:
+ Most days there are a small number of VMs that are unreachable, and those VMs
+ have the following messages in neutron-ovn-metadata-agent.log:
- VM: 08af4c45-2755-41d6-960c-ce67ecb183cc on host sdtpdc41s100020.xxx.net
- created: 2024-02-29T03:34:17Z
- network: 3be5f44d-39de-4c38-a77f-06c0d9ee42b0
- from neutron-ovn-metadata-agent.log.4.gz:
- 2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
- 2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
- 2024-02-29 03:34:30.275 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 4869dcc4-e1dd-4d5c-94dc-f491b8b4211c in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
- 2024-02-29 03:34:30.284 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
+ 2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent
+ [-] There is no metadata port for network
+ 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses
+ configured, tearing the namespace down if needed _get_provision_params
+ /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
- BTW, the hotfix containing the fix has been deployed in the the
- customer's env for a long time, the customer reported: "We applied the
- packages on one of the compute hosts and performed a test with creating
- 100 VMs in sequence but we did not see any failures in VM creation."
+ There are test packages available in:
- Given the lack of a reproducer, I continued to run the charmed-
- openstack-tester according to SRU standards to ensure no regressions
- were introduced.
+ https://launchpad.net/~mruffell/+archive/ubuntu/sf375454-updates
- and as of today (20250509), this fix has also been deployed in a
- customer env via hotfix, and no regression issues have been observed so
- far. Of course, it remains unclear whether the fix actually resolves the
- original problem, as the issue itself is rare in the customer env as
- well. But I can say for sure (99.99%) that there is no regressions.
+ Some previous test packages have been running in the user's test environment for
+ several months, with zero metadata namespace issues since rollout. We issued
+ the user a hotfix and it has been running in production for the past month
+ and they have also had zero metadata namespace issues since rollout.
- Not able to reproduce this easily, so I run charmed-openstack-tester,
- the result is below:
-
- ======
- Totals
- ======
- Ran: 469 tests in 4273.6309 sec.
- - Passed: 398
- - Skipped: 69
- - Expected Fail: 0
- - Unexpected Success: 0
- - Failed: 2
- Sum of execute time for each test: 4387.2727 sec.
-
- 2 failed tests
- (tempest.api.object_storage.test_account_quotas.AccountQuotasTest and
- octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest)
- is not related to the ovn metadata and this fix, whether or not you use
- this fix, you will have these 2 failed tests, so we can ignore these 2
- failed tests.
+ When this enters -proposed, it will be verified in the user's production
+ environment and subject to their nightly runs of their scalability tests, with
+ the results collected after a week or so of runs. After that we should be
+ confident the -proposed packages fix the issue.
[Where problems could occur]
- This patches are related to ovn metadata agent in compute.
- VM's connectivity can possibly be affected by this patch when ovn is used.
- Biding port to datapath could be affected.
- [Others]
+ We are changing ovn-metadata-agent in neutron, and any issues would be limtied
+ to ovn-metadata-agent only. ovn-metadata-agent will now listen for both
+ insert and update notifications by ovsdb-server, instead of just update
+ notifications beforehand. It shouldn't impact any existing functionality.
+
+ If a regression were to occur, it would affect attaching metadata namespaces to
+ newly created VMs, which prevents it from getting its initial metadata URL /
+ DHCP lease / IP address information, which would cause connectivity issues for
+ newly created VMs. It shouldn't impact any existing VMs.
+
+ There are no workarounds if a regression were to occur, other than to downgrade
+ the package.
+
+ [Other info]
+
+ This was fixed upstream by:
+
+ commit a641e8aec09c1e33a15a34b19d92675ed2c85682
+ From: Terry Wilson <twilson at redhat.com>
+ Date: Fri, 15 Dec 2023 21:00:43 +0000
+ Subject: Handle creation of Port_Binding with chassis set
+ Link: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682
+
+ This patch landed in Caracal. The patch is for Zed, Antelope and Bobcat, but it
+ depends on the following commit:
+
+ commit 6801589510242affc78497660d34377603774074
+ From: Jakub Libosvar <libosvar at redhat.com>
+ Date: Thu, 21 Sep 2023 19:40:36 +0000
+ Subject: ovn-metadata: Refactor events
+ Link: https://opendev.org/openstack/neutron/commit/6801589510242affc78497660d34377603774074
+
+ After some discussion, we (mruffell, brian-haley, hopem) decided that it would
+ be too much of a regression risk to backport "ovn-metadata: Refactor events"
+ to Zed, Antelope and Bobcat, we marked this "Won't fix".
+
+ Now, the user is on yoga, so, Brian Haley wrote a new backport that does not
+ depend on "ovn-metadata: Refactor events" which is the following commit in
+ neutron yoga:
+
+ commit 952e960414e7c15d4d4351bf2300ce53a69e4051
+ From: Terry Wilson <twilson at redhat.com>
+ Date: Tue, 20 Aug 2024 10:20:52 -0500
+ Subject: Handle creation of Port_Binding with chassis set
+ Link: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051
+
+ This is what we are suggesting for SRU to jammy / yoga.
+
+ There is a low chance of an upgrade regression for users going from yoga -> zed
+ -> antelope -> bobcat -> caracal (fixed), due to users likely not running
+ heavy stress tests during series upgrade, and would likely run heavy
+ stress tests when they land on caracal instead.
+
+ If we have to, we will consider zed, antelope, bobcat in the future, but for
+ now, just yoga only.
== ORIGINAL DESCRIPTION ==
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650
During a scalability test it was noted that a few VMs where having
issues being pinged (2 out of ~5000 VMs in the test conducted). After
some investigation it was found that the VMs in question did not receive
a DHCP lease:
udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed
And the ovnmeta- namespaces for the networks that the VMs was booting
from were missing. Looking into the ovn-metadata-agent.log:
2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent
[-] There is no metadata port for network
9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses
configured, tearing the namespace down if needed _get_provision_params
/usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py:495
Apparently, when the system is under stress (scalability tests) there
are some edge cases where the metadata port information has not yet
being propagated by OVN to the Southbound database and when the
PortBindingChassisEvent event is being handled and try to find either
the metadata port of the IP information on it (which is updated by
ML2/OVN during subnet creation) it can not be found and fails silently
with the error shown above.
Note that, running the same tests but with less concurrency did not
trigger this issue. So only happens when the system is overloaded.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/2017748
Title:
[SRU] OVN: ovnmeta namespaces missing during scalability test causing
DHCP issues
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive antelope series:
Won't Fix
Status in Ubuntu Cloud Archive bobcat series:
Won't Fix
Status in Ubuntu Cloud Archive caracal series:
Fix Released
Status in Ubuntu Cloud Archive dalmatian series:
Fix Released
Status in Ubuntu Cloud Archive epoxy series:
Fix Released
Status in Ubuntu Cloud Archive yoga series:
In Progress
Status in Ubuntu Cloud Archive zed series:
Won't Fix
Status in neutron:
New
Status in neutron ussuri series:
Fix Released
Status in neutron victoria series:
New
Status in neutron wallaby series:
New
Status in neutron xena series:
New
Status in neutron package in Ubuntu:
Fix Released
Status in neutron source package in Focal:
In Progress
Status in neutron source package in Jammy:
New
Status in neutron source package in Noble:
Fix Released
Status in neutron source package in Oracular:
Fix Released
Status in neutron source package in Plucky:
Fix Released
Bug description:
[Impact]
During scalability tests where extreme load is generated by creating thousands
of VMs all at the same time, some VMs fail to get a DHCP lease and cannot be
pinged or sshed to after deployment.
The ovnmeta namespaces for networks that the VMs were created in are missing.
The following lines are present in neutron-ovn-metadata-agent.log:
2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
What is happening is that under extreme load, sometimes the metadata port
information has not been propagated by OVN to the Southbound database, which
usually takes the form of a update notification, and when
PortBindingChassisEvent event is triggered in ovn-metadata-agent, it only looks
for update notifications, finds none, so it doesn't know any metadata port or IP
information, fails, logs the message above, and tears down the ovnmetadata
namespace for that VM.
Eventually ovsdb-server catches up, and merges insert and update notifications
and sends them out as a insert notification, which PortBindingChassisEvent
currently ignores, and the metadata is never applied to the VM.
This is a race condition, and it doesn't happen when under normal conditions,
as the metadata would just be delivered a update notification.
The fix is to also listen for insert notifications, and act on them.
[Test Case]
This can't be reproduced in the lab, even after many attempts.
A user sees this issue daily in production, where they run a scalability test
every night, in which they create a new tenant, create all necessary resources
(networks, subnets, routers, load balancers, etc.) and start several thousand
VMs. They then audit the deployment and verify that everything deployed
correctly.
Most days there are a small number of VMs that are unreachable, and those VMs
have the following messages in neutron-ovn-metadata-agent.log:
2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent
[-] There is no metadata port for network
3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses
configured, tearing the namespace down if needed _get_provision_params
/usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
There are test packages available in:
https://launchpad.net/~mruffell/+archive/ubuntu/sf375454-updates
Some previous test packages have been running in the user's test environment for
several months, with zero metadata namespace issues since rollout. We issued
the user a hotfix and it has been running in production for the past month
and they have also had zero metadata namespace issues since rollout.
When this enters -proposed, it will be verified in the user's production
environment and subject to their nightly runs of their scalability tests, with
the results collected after a week or so of runs. After that we should be
confident the -proposed packages fix the issue.
[Where problems could occur]
We are changing ovn-metadata-agent in neutron, and any issues would be limtied
to ovn-metadata-agent only. ovn-metadata-agent will now listen for both
insert and update notifications by ovsdb-server, instead of just update
notifications beforehand. It shouldn't impact any existing functionality.
If a regression were to occur, it would affect attaching metadata namespaces to
newly created VMs, which prevents it from getting its initial metadata URL /
DHCP lease / IP address information, which would cause connectivity issues for
newly created VMs. It shouldn't impact any existing VMs.
There are no workarounds if a regression were to occur, other than to downgrade
the package.
[Other info]
This was fixed upstream by:
commit a641e8aec09c1e33a15a34b19d92675ed2c85682
From: Terry Wilson <twilson at redhat.com>
Date: Fri, 15 Dec 2023 21:00:43 +0000
Subject: Handle creation of Port_Binding with chassis set
Link: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682
This patch landed in Caracal. The patch is for Zed, Antelope and Bobcat, but it
depends on the following commit:
commit 6801589510242affc78497660d34377603774074
From: Jakub Libosvar <libosvar at redhat.com>
Date: Thu, 21 Sep 2023 19:40:36 +0000
Subject: ovn-metadata: Refactor events
Link: https://opendev.org/openstack/neutron/commit/6801589510242affc78497660d34377603774074
After some discussion, we (mruffell, brian-haley, hopem) decided that it would
be too much of a regression risk to backport "ovn-metadata: Refactor events"
to Zed, Antelope and Bobcat, we marked this "Won't fix".
Now, the user is on yoga, so, Brian Haley wrote a new backport that does not
depend on "ovn-metadata: Refactor events" which is the following commit in
neutron yoga:
commit 952e960414e7c15d4d4351bf2300ce53a69e4051
From: Terry Wilson <twilson at redhat.com>
Date: Tue, 20 Aug 2024 10:20:52 -0500
Subject: Handle creation of Port_Binding with chassis set
Link: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051
This is what we are suggesting for SRU to jammy / yoga.
There is a low chance of an upgrade regression for users going from yoga -> zed
-> antelope -> bobcat -> caracal (fixed), due to users likely not running
heavy stress tests during series upgrade, and would likely run heavy
stress tests when they land on caracal instead.
If we have to, we will consider zed, antelope, bobcat in the future, but for
now, just yoga only.
== ORIGINAL DESCRIPTION ==
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650
During a scalability test it was noted that a few VMs where having
issues being pinged (2 out of ~5000 VMs in the test conducted). After
some investigation it was found that the VMs in question did not
receive a DHCP lease:
udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed
And the ovnmeta- namespaces for the networks that the VMs was booting
from were missing. Looking into the ovn-metadata-agent.log:
2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent
[-] There is no metadata port for network
9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses
configured, tearing the namespace down if needed _get_provision_params
/usr/lib/python3.9/site-
packages/neutron/agent/ovn/metadata/agent.py:495
Apparently, when the system is under stress (scalability tests) there
are some edge cases where the metadata port information has not yet
being propagated by OVN to the Southbound database and when the
PortBindingChassisEvent event is being handled and try to find either
the metadata port of the IP information on it (which is updated by
ML2/OVN during subnet creation) it can not be found and fails silently
with the error shown above.
Note that, running the same tests but with less concurrency did not
trigger this issue. So only happens when the system is overloaded.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2017748/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list