[Bug 1918948] Re: Issue in Extended Disk Data retrieval (biosdisk: int 13h/service 48h)
Guilherme G. Piccoli
1918948 at bugs.launchpad.net
Fri Mar 12 17:49:04 UTC 2021
Could narrow the problem down to both a Grub issue, but more relevant -
to a Dell BIOS issue as well. So, 2 bugs!
Regarding GRUB, the disk information is read through the int 13h/service
48h from BIOS[0]. But testing in an HP machine (with HW RAID as well),
and comparing that with what we call the EDD kernel module[1] output
(the early boot kernel code that runs in realmode and performs the exact
same query as GRUB), it seems GRUB is only filling the lower 32 bits of
the total_sectors variable. This is indeed a bug, and I've reported that
to GRUB community [2]. But notice I've mentioned an HP machine...because
the Dell machine has an even worse bug.
In the case of Dell, the BIOS reading both from kernel EDD module and
GRUB as well is 0xFFFFFFFF - this value is the full 32-bit mask filled,
which would correspond to a disk of 2TB. In my case, the HW RAID5
exposes a disk of 8TB! So, regardless if GRUB is only filling the lower
32 bits, the value from Dell BIOS is wrong. I've sought Dell support
through some Dell contacts I've found in kernel commits [3]; they
promptly responded, forwarding internally my issue report to their
firmware team, but since then they are quiet - hope they can clarify us
on the reason of this bogus return value from their BIOS.
GRUB community is also quiet regarding my bug report in their code.
[0] https://en.wikipedia.org/wiki/INT_13H#INT_13h_AH=48h:_Extended_Read_Drive_Parameters"
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/boot/edd.c
[2] https://lists.gnu.org/archive/html/grub-devel/2021-01/msg00052.html
[3] https://lists.gnu.org/archive/html/grub-devel/2021-01/msg00050.html
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1918948
Title:
Issue in Extended Disk Data retrieval (biosdisk: int 13h/service 48h)
Status in grub2 package in Ubuntu:
Confirmed
Bug description:
We have an user reporting the following issue:
After an upgrade, grub couldn't boot any kernel. The system is not
running in UEFI mode, so "grub-pc" is the package used - also it is a
HW RAID5 setup (Dell machine). The bootloader itself was able to get
loaded, including all its base modules (hence the bootloader could
read/write from disk) - also, grub packages were up-to-date and seemed
properly installed. The following kernels were present/installed
there: 4.4.0-148, 4.4.0-189, 4.4.0-190, 4.4.0-193, 4.4.0-194 .
Attempting to boot the most recent version (-194), we got the
following grub error: "error: attempt read-write outside of disk
`hd0`" - even dropping to the grub shell and manually trying to load
the file vmlinuz-4.4.0-194-generic (which was being accessed/seen by
grub "ext4 module"), we got the mentioned error. Now, if we tried to
boot *all* the other kernels, we managed to load vmlinuz image for
them, but not the initrd - in this case we still get the message
"attempt read-write outside of disk" but grub allows the boot to
continue, and as expected, Linux will fail due to the lack of the
ramdisk image.
After booting from a virtual ISO (Ubuntu installer), we managed to "update-grub", "update-initramfs" and "grub-install", not forgetting to "sync" after all these commands. We previously duplicated all initrds, saving them as initrd.img-<kernel_version>.bk . Even after all that, the exact same symptom was observed in grub. By doing then a complete manual test with all vmlinuz/initrd pairs from grub shell, we noticed that the pair vmlinuz-4.4.0-148/initrd-4.4.0-148.bk were both readable from grub, so we could boot them. For some reason, it failed (later we observed that this kernel is not properly installed, missing a bunch of modules in /lib/modules, like the tg3 network driver). Even with an impaired kernel, from the initramfs shell the following actions were taken that rendered the system to a bootable state:
(A) We apt-get removed kernels -189 and -190 (and their initrd backups)
(B) We moved all the remaining vmlinuz/initrd pairs (and their backups) to "/"
(C) We *copied* all of them back to /boot, with the goal of duplicating the files in the filesystem
We double-checked the md5 hashes of all the vmlinuz/initrd pairs and
they matched, so the *same files* are present in "/" and "/boot". We
also checked vmlinuz-4.4.0-{193,194} md5 hashes against the package
version, and they matched, so the images are good/healthy. After that
all of that (we re-executed "update-grub" and "sync)," we got
repeated/doubled entries in grub: we have one entry for the
vmlinuz/initrd pair in "/boot" and one for the pair in "/" (the
original files). The original files *still cannot boot*, grub
complains with the same error message. The duplicate files on /boot
can boot normally, we tried kernels -194 and -193 twice, both booted.
So the (very odd and interesting) problem is: grub can read some files
and others it cannot read, even we knowing that *all the duplicate
files are the same* and have proved integrity (i.e., the filesystem
and the storage controller/disks seems to be healthy). Why? Very
similar problems were reports in [0] and [1] with no really
good/definitive answer.
HYPOTHESIS:
I think this has to do with the fact that grub *cannot* read some
sectors of the underlying disks, but not due to disk corruption, but
due to logical sector accounting/math. Since it's a hardware RAID, I
understand that from Linux perspective, it is "seen" as a single
device. And even from grub perspective, it's a single disk (called
'hd0' in grub terminology). But maybe grub is doing some low-level
queries to gather physical device information on the underlying disks,
and when it calculates the sector math, it notices the "section" to be
read is outside of the "available" area of the device, giving us this
error. Some mentions of "BIOS restrictions" in [0] or [1] could be
also considered, the BIOS or even Grub could be unable to deal with
files outside some "range" in the disk, like for security reasons -
although I doubt that, I'm more keen to the first theory.
In both theories, it ends-up being a restriction in loading a file
*depending* on its logical position in the disk. If that is true, it's
a very awkward limitation. The following data was suggested to be
collected by user, to understand the topology of the disk and the
logical position (LBA) of the files:
debugfs -R "stat /boot/vmlinuz-4.4.0-194-generic" /dev/sda2 > debugfs-vmlinuz194-b.out
hdparm --fibmap /boot/vmlinuz-4.4.0-194-generic > hdparm-vmlinuz194-b.out
debugfs -R "stat /vmlinuz-4.4.0-194-generic" /dev/sda2 > debugfs-vmlinuz194-r.out
hdparm --fibmap /vmlinuz-4.4.0-194-generic > hdparm-vmlinuz194-r.out
[0] https://askubuntu.com/q/867047
[1] https://askubuntu.com/q/416418
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1918948/+subscriptions
More information about the foundations-bugs
mailing list