[Bug 2017748] Re: [SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues
Hua Zhang
2017748 at bugs.launchpad.net
Wed May 28 09:14:10 UTC 2025
Hi Andreas (@ahasenack),
I just added some details about how the customer is really impacted into
[Test Case] as well, thanks.
I can't find a reproducer so I have also done further research into how
the issue was reproduced on the customer's side. The customer runs they
nightly scripts few times a day which basically create a tenant and all
resources (networks, subnets, routers, vms, lbs, etc.) and verifies of
they were created correctly. Intermittently customer notices that VM is
unreachable after creation, and in the last two use cases, they saw that
the ovn namespace was missing on the compute host, due to this even
though the VM was created, metadata URL is not reachable. We do see
similar logs in the customer's env, such as:
VM: 08af4c45-2755-41d6-960c-ce67ecb183cc on host sdtpdc41s100020.xxx.net
created: 2024-02-29T03:34:17Z
network: 3be5f44d-39de-4c38-a77f-06c0d9ee42b0
from neutron-ovn-metadata-agent.log.4.gz:
2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
2024-02-29 03:34:30.275 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 4869dcc4-e1dd-4d5c-94dc-f491b8b4211c in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:34:30.284 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
BTW, the hotfix containing the fix has been deployed in the the
customer's env for a long time, the customer reported: "We applied the
packages on one of the compute hosts and performed a test with creating
100 VMs in sequence but we did not see any failures in VM creation."
** Description changed:
[Impact]
ovnmeta- namespaces are missing intermittently then can't reach to VMs.
The ovn metadata namespace may be missing intermittently under certain
conditions, such as high load. This prevents VMs from retrieving
metadata (e.g., ssh keys), making them unreachable. The issue is not
easily reproducible.
[Test Case]
This issue is theoretically reproducible under certian condistions, such
as high load. Howevr, in practice, it has proven extremely difficult to
reproduce.
I first talked with the fix author, Brian, who confirmed that he does
not have a reprodcer. I then did almost 10 tests attempts to reproduce
the issue, but was unsuccefully, pls refer to this pastebin for more
details - https://paste.ubuntu.com/p/H6vh8jycvC/
+
+ I can't find a reproducer so I have also done further research into how
+ the issue was reproduced on the customer's side. The customer runs they
+ nightly scripts few times a day which basically create a tenant and all
+ resources (networks, subnets, routers, vms, lbs, etc.) and verifies of
+ they were created correctly. Intermittently customer notices that VM is
+ unreachable after creation, and in the last two use cases, they saw that
+ the ovn namespace was missing on the compute host, due to this even
+ though the VM was created, metadata URL is not reachable. We do see
+ similar logs in the customer's env, such as:
+
+ VM: 08af4c45-2755-41d6-960c-ce67ecb183cc on host sdtpdc41s100020.xxx.net
+ created: 2024-02-29T03:34:17Z
+ network: 3be5f44d-39de-4c38-a77f-06c0d9ee42b0
+ from neutron-ovn-metadata-agent.log.4.gz:
+ 2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
+ 2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
+ 2024-02-29 03:34:30.275 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 4869dcc4-e1dd-4d5c-94dc-f491b8b4211c in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
+ 2024-02-29 03:34:30.284 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
+
+ BTW, the hotfix containing the fix has been deployed in the the
+ customer's env for a long time, the customer reported: "We applied the
+ packages on one of the compute hosts and performed a test with creating
+ 100 VMs in sequence but we did not see any failures in VM creation."
Given the lack of a reproducer, I continued to run the charmed-
openstack-tester according to SRU standards to ensure no regressions
were introduced.
and as of today (20250509), this fix has also been deployed in a
customer env via hotfix, and no regression issues have been observed so
far. Of course, it remains unclear whether the fix actually resolves the
original problem, as the issue itself is rare in the customer env as
well. But I can say for sure (99.99%) that there is no regressions.
Not able to reproduce this easily, so I run charmed-openstack-tester,
the result is below:
======
Totals
======
Ran: 469 tests in 4273.6309 sec.
- Passed: 398
- Skipped: 69
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 2
Sum of execute time for each test: 4387.2727 sec.
2 failed tests
(tempest.api.object_storage.test_account_quotas.AccountQuotasTest and
octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest)
is not related to the ovn metadata and this fix, whether or not you use
this fix, you will have these 2 failed tests, so we can ignore these 2
failed tests.
[Where problems could occur]
This patches are related to ovn metadata agent in compute.
VM's connectivity can possibly be affected by this patch when ovn is used.
Biding port to datapath could be affected.
[Others]
== ORIGINAL DESCRIPTION ==
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650
During a scalability test it was noted that a few VMs where having
issues being pinged (2 out of ~5000 VMs in the test conducted). After
some investigation it was found that the VMs in question did not receive
a DHCP lease:
udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed
And the ovnmeta- namespaces for the networks that the VMs was booting
from were missing. Looking into the ovn-metadata-agent.log:
2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent
[-] There is no metadata port for network
9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses
configured, tearing the namespace down if needed _get_provision_params
/usr/lib/python3.9/site-packages/neutron/agent/ovn/metadata/agent.py:495
Apparently, when the system is under stress (scalability tests) there
are some edge cases where the metadata port information has not yet
being propagated by OVN to the Southbound database and when the
PortBindingChassisEvent event is being handled and try to find either
the metadata port of the IP information on it (which is updated by
ML2/OVN during subnet creation) it can not be found and fails silently
with the error shown above.
Note that, running the same tests but with less concurrency did not
trigger this issue. So only happens when the system is overloaded.
--
You received this bug notification because you are a member of Ubuntu
OpenStack, which is subscribed to Ubuntu Cloud Archive.
https://bugs.launchpad.net/bugs/2017748
Title:
[SRU] OVN: ovnmeta namespaces missing during scalability test causing
DHCP issues
Status in Ubuntu Cloud Archive:
Fix Released
Status in Ubuntu Cloud Archive antelope series:
Won't Fix
Status in Ubuntu Cloud Archive bobcat series:
Won't Fix
Status in Ubuntu Cloud Archive caracal series:
Fix Released
Status in Ubuntu Cloud Archive dalmatian series:
Fix Released
Status in Ubuntu Cloud Archive epoxy series:
Fix Released
Status in Ubuntu Cloud Archive yoga series:
In Progress
Status in Ubuntu Cloud Archive zed series:
Won't Fix
Status in neutron:
New
Status in neutron ussuri series:
Fix Released
Status in neutron victoria series:
New
Status in neutron wallaby series:
New
Status in neutron xena series:
New
Status in neutron package in Ubuntu:
Fix Released
Status in neutron source package in Focal:
In Progress
Status in neutron source package in Jammy:
New
Status in neutron source package in Noble:
Fix Released
Status in neutron source package in Oracular:
Fix Released
Status in neutron source package in Plucky:
Fix Released
Bug description:
[Impact]
ovnmeta- namespaces are missing intermittently then can't reach to
VMs.
The ovn metadata namespace may be missing intermittently under certain
conditions, such as high load. This prevents VMs from retrieving
metadata (e.g., ssh keys), making them unreachable. The issue is not
easily reproducible.
[Test Case]
This issue is theoretically reproducible under certian condistions,
such as high load. Howevr, in practice, it has proven extremely
difficult to reproduce.
I first talked with the fix author, Brian, who confirmed that he does
not have a reprodcer. I then did almost 10 tests attempts to reproduce
the issue, but was unsuccefully, pls refer to this pastebin for more
details - https://paste.ubuntu.com/p/H6vh8jycvC/
I can't find a reproducer so I have also done further research into
how the issue was reproduced on the customer's side. The customer runs
they nightly scripts few times a day which basically create a tenant
and all resources (networks, subnets, routers, vms, lbs, etc.) and
verifies of they were created correctly. Intermittently customer
notices that VM is unreachable after creation, and in the last two use
cases, they saw that the ovn namespace was missing on the compute
host, due to this even though the VM was created, metadata URL is not
reachable. We do see similar logs in the customer's env, such as:
VM: 08af4c45-2755-41d6-960c-ce67ecb183cc on host sdtpdc41s100020.xxx.net
created: 2024-02-29T03:34:17Z
network: 3be5f44d-39de-4c38-a77f-06c0d9ee42b0
from neutron-ovn-metadata-agent.log.4.gz:
2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
2024-02-29 03:34:30.275 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 4869dcc4-e1dd-4d5c-94dc-f491b8b4211c in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
2024-02-29 03:34:30.284 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494
BTW, the hotfix containing the fix has been deployed in the the
customer's env for a long time, the customer reported: "We applied the
packages on one of the compute hosts and performed a test with
creating 100 VMs in sequence but we did not see any failures in VM
creation."
Given the lack of a reproducer, I continued to run the charmed-
openstack-tester according to SRU standards to ensure no regressions
were introduced.
and as of today (20250509), this fix has also been deployed in a
customer env via hotfix, and no regression issues have been observed
so far. Of course, it remains unclear whether the fix actually
resolves the original problem, as the issue itself is rare in the
customer env as well. But I can say for sure (99.99%) that there is no
regressions.
Not able to reproduce this easily, so I run charmed-openstack-tester,
the result is below:
======
Totals
======
Ran: 469 tests in 4273.6309 sec.
- Passed: 398
- Skipped: 69
- Expected Fail: 0
- Unexpected Success: 0
- Failed: 2
Sum of execute time for each test: 4387.2727 sec.
2 failed tests
(tempest.api.object_storage.test_account_quotas.AccountQuotasTest and
octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest)
is not related to the ovn metadata and this fix, whether or not you
use this fix, you will have these 2 failed tests, so we can ignore
these 2 failed tests.
[Where problems could occur]
This patches are related to ovn metadata agent in compute.
VM's connectivity can possibly be affected by this patch when ovn is used.
Biding port to datapath could be affected.
[Others]
== ORIGINAL DESCRIPTION ==
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650
During a scalability test it was noted that a few VMs where having
issues being pinged (2 out of ~5000 VMs in the test conducted). After
some investigation it was found that the VMs in question did not
receive a DHCP lease:
udhcpc: no lease, failing
FAIL
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 181.90. request failed
And the ovnmeta- namespaces for the networks that the VMs was booting
from were missing. Looking into the ovn-metadata-agent.log:
2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent
[-] There is no metadata port for network
9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses
configured, tearing the namespace down if needed _get_provision_params
/usr/lib/python3.9/site-
packages/neutron/agent/ovn/metadata/agent.py:495
Apparently, when the system is under stress (scalability tests) there
are some edge cases where the metadata port information has not yet
being propagated by OVN to the Southbound database and when the
PortBindingChassisEvent event is being handled and try to find either
the metadata port of the IP information on it (which is updated by
ML2/OVN during subnet creation) it can not be found and fails silently
with the error shown above.
Note that, running the same tests but with less concurrency did not
trigger this issue. So only happens when the system is overloaded.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2017748/+subscriptions
More information about the Ubuntu-openstack-bugs
mailing list