[Bug 2056194] Re: Networking broken in early boot on Oracle Native instances due to MTU settings
Launchpad Bug Tracker
2056194 at bugs.launchpad.net
Wed Mar 27 17:59:27 UTC 2024
This bug was fixed in the package initramfs-tools - 0.142ubuntu23
---------------
initramfs-tools (0.142ubuntu23) noble; urgency=medium
[ Daniel van Vugt ]
* hooks/framebuffer: Only add simple/tiny framebuffer drivers. This is to
limit the size of initrd when FRAMEBUFFER=y is soon enabled for desktop
installations (LP: #1970069, #1869655).
[ Benjamin Drung ]
* autopkgtest: Increase QEMU timeouts on arm64/armhf
* hooks/framebuffer:
- Move adding framebuffer drivers into auto_add_modules
- Drop looking in $MODULESDIR/initrd/ for kernel modules
- Support MODULES=dep in framebuffer hook
initramfs-tools (0.142ubuntu22) noble; urgency=medium
* autopkgtest: update systemd-udevd path from /lib to /usr/lib
initramfs-tools (0.142ubuntu21) noble; urgency=medium
[ Benjamin Drung ]
* configure_networking:
- Increase minimum timeout to 30 seconds
- Fix configuring BOOTIF when using iSCSI (LP: #2056187)
- Set interface MTU if provided by the DHCP server (LP: #2056194)
- log sleep durations before retries
* Copy /etc/passwd into the initramfs to allow dhcpcd running as dhcpcd user
* Replace obsolete pkg-config build-dependency by pkgconf
[ Dan Bungert ]
* Restore nvdimm and dax pmem-related modules (LP: #1981385)
-- Benjamin Drung <bdrung at ubuntu.com> Thu, 21 Mar 2024 10:57:54 +0100
** Changed in: initramfs-tools (Ubuntu)
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to initramfs-tools in Ubuntu.
https://bugs.launchpad.net/bugs/2056194
Title:
Networking broken in early boot on Oracle Native instances due to MTU
settings
Status in cloud-images:
New
Status in cloud-init package in Ubuntu:
Fix Released
Status in initramfs-tools package in Ubuntu:
Fix Released
Bug description:
BACKGROUND:
cloud-init-local.service runs before networking has started. On non-
Oracle platforms, before networking has come up, cloud-init will
create an ephemeral connection to the cloud's IMDS using DHCP to
retrieve instance metadata. On Oracle, this normally isn't necessary
as we boot with connectivity to the IMDS out of the box. This can be
seen in the following Jammy instance using an SR-IOV NIC:
2024-03-05 14:09:05,351 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0, 'headers': {'User-Agent'
: 'Cloud-Init/23.3.3-0ubuntu0~22.04.1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:09:05,362 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 2663b) after 1 attempts
2024-03-05 14:09:05,362 - ephemeral.py[DEBUG]: Skip ephemeral DHCP setup, instance has connectivity to {'url': 'http://169.254.169.254/opc/v2/instance/', 'headers': {'Authorization': 'Bearer Oracle'}, 'timeout': 5}
2024-03-05 14:09:05,362 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/23
.3.3-0ubuntu0~22.04.1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:09:05,368 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 2663b) after 1 attempts
Notice the "Skip ephemeral DHCP setup, instance has connectivity".
This means that cloud-init has determined that it already has
connectivity and doesn't need to do any additional setup to retrieve
data from the IMDS.
We can also see the same behavior on a Noble paravirtualized instance:
2024-03-01 20:51:33,482 - url_helper.py[DEBUG]: [0/1] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 5.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-01 20:51:33,488 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 3067b) after 1 attempts
2024-03-01 20:51:33,488 - ephemeral.py[DEBUG]: Skip ephemeral DHCP setup, instance has connectivity to {'url': 'http://169.254.169.254/opc/v2/instance/', 'headers': {'Authorization': 'Bearer Oracle'}, 'timeout': 5}
2024-03-01 20:51:33,489 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-01 20:51:33,500 - url_helper.py[DEBUG]: Read from http://169.254.169.254/opc/v2/instance/ (200, 3067b) after 1 attempts
2024-03-01 20:51:33,501 - util.py[DEBUG]: Writing to /run/cloud-init/cloud-id-oracle - wb: [644] 7 bytes
PROBLEM:
On a Noble instance using Hardware-assisted (SR-IOV) networking, this
is not working. cloud-init-local.service no longer has immediate
connectivity to the IMDS. Since it cannot connect, in then attempts to
create an ephemeral connection to the IMDS using DHCP. It is able to
obtain a DHCP lease, but then when it tries to connect to the IMDS,
the call just hangs. The call has no timeout, so this results in an
instance that cannot be logged into even via the serial console
because cloud-init is blocking the rest of boot. A simple cloud-init
workaround is to add something along the lines of `timeout=2` to
https://github.com/canonical/cloud-
init/blob/main/cloudinit/sources/DataSourceOracle.py#L349 . This
allows cloud-init to boot. Looking at the logs, we can see that cloud-
init is unable to connect to the IMDS:
2024-03-05 14:23:54,836 - ephemeral.py[DEBUG]: Received dhcp lease on ens3 for 10.0.0.133/255.255.255.0
2024-03-05 14:23:54,837 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:23:56,841 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:23:57,842 - url_helper.py[DEBUG]: [1/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:23:59,847 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:24:00,847 - url_helper.py[DEBUG]: [2/3] open 'http://169.254.169.254/opc/v2/instance/' with {'url': 'http://169.254.169.254/opc/v2/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1', 'Authorization': 'Bearer Oracle'}} configuration
2024-03-05 14:24:02,852 - url_helper.py[DEBUG]: [0/3] open 'http://169.254.169.254/opc/v1/instance/' with {'url': 'http://169.254.169.254/opc/v1/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1'}} configuration
2024-03-05 14:24:04,855 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:24:05,855 - url_helper.py[DEBUG]: [1/3] open 'http://169.254.169.254/opc/v1/instance/' with {'url': 'http://169.254.169.254/opc/v1/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1'}} configuration
2024-03-05 14:24:07,859 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
2024-03-05 14:24:08,859 - url_helper.py[DEBUG]: [2/3] open 'http://169.254.169.254/opc/v1/instance/' with {'url': 'http://169.254.169.254/opc/v1/instance/', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'timeout': 2.0, 'headers': {'User-Agent': 'Cloud-Init/24.1~7g54599148-0ubuntu1'}} configuration
2024-03-05 14:24:10,863 - handlers.py[DEBUG]: finish: init-local/search-Oracle: FAIL: no local data found from DataSourceOracle
2024-03-05 14:24:10,863 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceOracle.DataSourceOracle'> failed
2024-03-05 14:24:10,863 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceOracle.DataSourceOracle'> failed
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 370, in read_opc_metadata
instance_data = _fetch(metadata_version, path="instance")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 346, in _fetch
return readurl(
^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 370, in readurl
raise excps[-1]
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=2.0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 1028, in find_source
if s.update_metadata_if_supported(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 914, in update_metadata_if_supported
result = self.get_data()
^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 460, in get_data
return_value = self._check_and_get_data()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 392, in _check_and_get_data
return self._get_data()
^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 165, in _get_data
fetched_metadata = read_opc_metadata(
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 373, in read_opc_metadata
instance_data = _fetch(metadata_version, path="instance")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOracle.py", line 346, in _fetch
return readurl(
^^^^^^^^
File "/usr/lib/python3/dist-packages/cloudinit/url_helper.py", line 370, in readurl
raise excps[-1]
cloudinit.url_helper.UrlError: HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=2.0)
2024-03-05 14:24:10,898 - main.py[DEBUG]: No local datasource found
Despite this, cloud-init is still able to read and render the
networking configuration sourced from initramfs:
2024-03-05 14:24:10,899 - util.py[DEBUG]: Read 272 bytes from /run/net-ens3.conf
...
2024-03-05 14:24:10,914 - stages.py[INFO]: Applying network configuration from initramfs bringup=False: {'config': [{'type': 'physical', 'name': 'ens3', 'subnets': [{'type': 'dhcp', 'control': 'manual', 'netmask': '255.255.255.0', 'broadcast': '10.0.0.255', 'gateway': '10.0.0.1', 'dns_nameservers': ['169.254.169.254']}], 'mac_address': '02:00:17:0f:50:8d'}], 'version': 1}
2024-03-05 14:24:10,914 - util.py[DEBUG]: Writing to /run/cloud-init/sem/apply_network_config.once - wb: [644] 23 bytes
2024-03-05 14:24:10,915 - distros[DEBUG]: Selected renderer 'netplan' from priority list: ['netplan', 'eni', 'sysconfig']
2024-03-05 14:24:10,918 - subp.py[DEBUG]: Running command ['netplan', 'info'] with allowed return codes [0] (shell=False, capture=True)
2024-03-05 14:24:11,109 - subp.py[DEBUG]: command ['netplan', 'info'] took 0.1s to run
2024-03-05 14:24:11,109 - util.py[DEBUG]: Attempting to load yaml from string of length 332 with allowed root types (<class 'dict'>,)
2024-03-05 14:24:11,111 - util.py[DEBUG]: Writing to /etc/netplan/50-cloud-init.yaml - wb: [600] 481 bytes
2024-03-05 14:24:11,111 - subp.py[DEBUG]: Running command ['netplan', 'generate'] with allowed return codes [0] (shell=False, capture=True)
2024-03-05 14:24:11,300 - subp.py[DEBUG]: command ['netplan', 'generate'] took 0.1s to run
This allows networking to come up as expected on the primary
interface, but cloud-init has been unable to fetch userdata/metadata
or retrieve information about any secondary interfaces.
SUMMARY:
I see two separate issues here:
1. Cloud-init should be able to deal with the lack of network in early boot. This can be fixed on the cloud-init side.
2. Early boot network connectivity works across every other series and instance type except for Noble using Hardware-assisted (SR-IOV) networking.
I am unsure the cause of #2.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-images/+bug/2056194/+subscriptions
More information about the foundations-bugs
mailing list