[Bug 1918948] Re: Issue in Extended Disk Data retrieval (biosdisk: int 13h/service 48h)

Fri Jun 3 13:30:55 UTC 2022

Hi Guilherme,

My machine is a 10-years old (but i7, quite powerful) HP Elitebook 8560w.
Old BIOS, not maintained anymore, that has an experimental / early UEFI
that I'd rather not use.
The SSD is a 8TB ATA Samsung SSD 870 device. No RAID involved, just a plain
8TB disk.
"--disk-module=ahci" get it to work because the Grub ahci native driver
support 64-bit LBA adressing, while GRUB by default uses the BIOS LBA, that
is 32-bit.
Since the partition table is GPT, GRUB has access to all information about
the disk size and partition addresses in 64-bit values. It does not have to
rely on a legacy DOS partition C/H/S info.
I assume GRUB uses BIOS by default to bet maximum compatibility with the
real BIOS, and this is fine until your boot files land above the 4TiB
limit... which is what happened to me a while after I upgraded my 4TB
Crucial disk to an 8TB Samsung one. At that point, Grub failed with a
message stating that an "attempt to read outside of hd0" message. The
message is actually not accurate, since the boot files are indeed inside
the partition. It's just that the BIOS driver is not able to reach them.

This problem should happen more and more now that the price of 8TB disks is
going down - and people start upgrading.
And this is not really Ubuntu not setting the default, as far as I
understand it, it's a pure all-GRUB (thus GNU / FSF) problem: The official
GRUB documentation mentions that GRUB defaults to BIOS drivers.
It would be nice if GRUB were upgraded to default to native drivers when it
detects the boot partition crosses the 4TiB limit...

Best!

On Thu, Jun 2, 2022 at 11:40 PM Guilherme G. Piccoli <
1918948 at bugs.launchpad.net> wrote:

> @Filofel, thanks for your report! Very interesting. Is your machine a
> Dell, running with HW RAID?
>
> My experience so far with GRUB and this LP bug is that there are two
> things here:
>
> (a) Seems commit [0] might be missing in old Ubuntu releases (IIRC
> Xenial, but *maybe* Bionic). This might cause issues with some disks 2T+
> ...
>
> (b) [And this is the main issue reported in this LP] Dell HW RAID
> *legacy BIOS* driver has a bug and fails to inform properly some data to
> Grub. In other words: imagine 5x1T disks composing a HW RAID of 5T. Grub
> asks BIOS data (through nativedisk likely) and the legacy BIOS driver
> from Dell returns data from disk 01 only - 1T of size. So, if GRUB
> itself (some module, for example) or the kernel/initrd are present in
> sectors *after* disk 01 size, it fails to load.
>
> For more reference on that, please check the following thread:
> https://lists.gnu.org/archive/html/grub-devel/2021-03/msg00380.html
> [According Dell, this is "expected" and UEFI is required!]
>
>
> So, your case seems a little bit different. Maybe wirth to clarify the
> exact model of your 4T disk, the version of Firmware (and of course the
> machine model) and maybe collect some GRUB logs.
> Interesting that you mentioned running with "--disk-module=ahci" make it
> works - I wonder why this isn't set by defualt on Ubuntu, likely there is a
> reason and I'm not aware.
>
> Cheers!
>
>
> [0] https://git.savannah.gnu.org/cgit/grub.git/commit/?id=d1130afa5f
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1918948
>
> Title:
>   Issue in Extended Disk Data retrieval (biosdisk: int 13h/service 48h)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1918948/+subscriptions
>
>

-- 
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to grub2 in Ubuntu.
https://bugs.launchpad.net/bugs/1918948

Title:
  Issue in Extended Disk Data retrieval (biosdisk: int 13h/service 48h)

Status in grub2 package in Ubuntu:
  Confirmed

Bug description:
  We have an user reporting the following issue:

  After an upgrade, grub couldn't boot any kernel. The system is not
  running in UEFI mode, so "grub-pc" is the package used - also it is a
  HW RAID5 setup (Dell machine). The bootloader itself was able to get
  loaded, including all its base modules (hence the bootloader could
  read/write from disk) - also, grub packages were up-to-date and seemed
  properly installed. The following kernels were present/installed
  there: 4.4.0-148, 4.4.0-189, 4.4.0-190, 4.4.0-193, 4.4.0-194 .

  Attempting to boot the most recent version (-194), we got the
  following grub error: "error: attempt read-write outside of disk
  `hd0`" - even dropping to the grub shell and manually trying to load
  the file vmlinuz-4.4.0-194-generic (which was being accessed/seen by
  grub "ext4 module"), we got the mentioned error. Now, if we tried to
  boot *all* the other kernels, we managed to load vmlinuz image for
  them, but not the initrd - in this case we still get the message
  "attempt read-write outside of disk" but grub allows the boot to
  continue, and as expected, Linux will fail due to the lack of the
  ramdisk image.

  After booting from a virtual ISO (Ubuntu installer), we managed to "update-grub", "update-initramfs" and "grub-install", not forgetting to "sync" after all these commands. We previously duplicated all initrds, saving them as initrd.img-<kernel_version>.bk . Even after all that, the exact same symptom was observed in grub. By doing then a complete manual test with all vmlinuz/initrd pairs from grub shell, we noticed that the pair vmlinuz-4.4.0-148/initrd-4.4.0-148.bk were both readable from grub, so we could boot them. For some reason, it failed (later we observed that this kernel is not properly installed, missing a bunch of modules in /lib/modules, like the tg3 network driver). Even with an impaired kernel, from the initramfs shell the following actions were taken that rendered the system to a bootable state:
  (A) We apt-get removed kernels -189 and -190 (and their initrd backups)
  (B) We moved all the remaining vmlinuz/initrd pairs (and their backups) to "/"
  (C) We *copied* all of them back to /boot, with the goal of duplicating the files in the filesystem

  We double-checked the md5 hashes of all the vmlinuz/initrd pairs and
  they matched, so the *same files* are present in "/" and "/boot". We
  also checked vmlinuz-4.4.0-{193,194} md5 hashes against the package
  version, and they matched, so the images are good/healthy. After that
  all of that (we re-executed "update-grub" and "sync)," we got
  repeated/doubled entries in grub: we have one entry for the
  vmlinuz/initrd pair in "/boot" and one for the pair in "/" (the
  original files). The original files *still cannot boot*, grub
  complains with the same error message. The duplicate files on /boot
  can boot normally, we tried kernels -194 and -193 twice, both booted.

  So the (very odd and interesting) problem is: grub can read some files
  and others it cannot read, even we knowing that *all the duplicate
  files are the same* and have proved integrity (i.e., the filesystem
  and the storage controller/disks seems to be healthy). Why? Very
  similar problems were reports in [0] and [1] with no really
  good/definitive answer.

  HYPOTHESIS:

  I think this has to do with the fact that grub *cannot* read some
  sectors of the underlying disks, but not due to disk corruption, but
  due to logical sector accounting/math. Since it's a hardware RAID, I
  understand that from Linux perspective, it is "seen" as a single
  device. And even from grub perspective, it's a single disk (called
  'hd0' in grub terminology). But maybe grub is doing some low-level
  queries to gather physical device information on the underlying disks,
  and when it calculates the sector math, it notices the "section" to be
  read is outside of the "available" area of the device, giving us this
  error. Some mentions of "BIOS restrictions" in [0] or [1] could be
  also considered, the BIOS or even Grub could be unable to deal with
  files outside some "range" in the disk, like for security reasons -
  although I doubt that, I'm more keen to the first theory.

  In both theories, it ends-up being a restriction in loading a file
  *depending* on its logical position in the disk. If that is true, it's
  a very awkward limitation. The following data was suggested to be
  collected by user, to understand the topology of the disk and the
  logical position (LBA) of the files:

  debugfs -R "stat /boot/vmlinuz-4.4.0-194-generic" /dev/sda2 > debugfs-vmlinuz194-b.out
  hdparm --fibmap /boot/vmlinuz-4.4.0-194-generic > hdparm-vmlinuz194-b.out

  debugfs -R "stat /vmlinuz-4.4.0-194-generic" /dev/sda2 > debugfs-vmlinuz194-r.out
  hdparm --fibmap /vmlinuz-4.4.0-194-generic > hdparm-vmlinuz194-r.out

  [0] https://askubuntu.com/q/867047
  [1] https://askubuntu.com/q/416418

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1918948/+subscriptions