[Bug 1036366] Re: software RAID arrays fail to start on boot
Doug Jones
djsdl at frombob.to
Fri Aug 24 04:52:48 UTC 2012
You're welcome, Dmitrijs.
Now that this system is finally behaving itself (for the first time in
the better part of a year!) I can now look at this properly functioning
configuration and compare it with the previously broken one. It is
becoming more clear what happened.
(Note: all of the following are things I have deduced by reading a vast
amount of material from different sources [and giving myself many
headaches in the process], and some of it may not be an entirely
accurate description of what's really happening.)
The superblock contains a field called name. It's not like /dev/md1 or /dev/md/1 or /dev/md1p1 or anything like that. On my system, it's more like 5. As it happened I had two very different arrays (different array levels, sizes, etc.) that both had that name. When you run Disk Utility and select a RAID array, this is the Name displayed in the right pane; if it's empty, the pane shows Name: -
but on my system two arrays showed Name: 5. I didn't choose this name; I think mdadm assigned it because each array happened to be mounted at /dev/md5 at the time it was created, and the two arrays were created at different times (of course these device names change arbitrarily whenever you boot).
But of course it's more complicated than that. That's just part of the
name; the superblock actually contains a 'fully qualified name' that is
of the form hostname:name and Disk Utility only displays the last part
of it. The hostname part is just the hostname at the time the array is
created.
My system has a long history. A year ago, its hostname was different,
and one of the arrays was created then. After the system became
unstable (when I upgraded to Oneiric and gained a particularly buggy
version of mdadm) I stopped using it and backed all the data off.
When Precise became available, I did a fresh install onto a non-RAID
partition and left all the existing RAID partitions in place for testing
purposes. Because I was no longer going to use this system as a file
server, but as a test machine, I gave it a different hostname
(precisetest). A bit later I added another array and mdadm assigned it
the name 5, presumably because it was sitting at /dev/md5 at the time.
I did not even notice this at first. Of course, the fully qualified
names stored in the superblocks were actually different, having
different hostnames on the front, so even though Disk Utility showed the
name 5 on both arrays, they really had different full names.
Although RAID was really messed up on this system, that was only a
problem at boot time. After booting, I could go into Disk Utility and
manually start all affected arrays. Once this was done, the system
worked great, until the next reboot. RAID was working; I could access
files on any array. I came to the (perhaps incorrect?) conclusion that
these two arrays having the same name was not a problem. After all, I
could look at mdadm.conf and see that the arrays really had different
names (the fully qualified names are shown there).
Now I am thinking that it is not sufficient that the fully qualified
names be unique. I think the part of the name after the : has to be
unique too, otherwise problems happen at boot, at least on Ubuntu
Precise. But I don't think mdadm upstream intended it to be that way.
So: some part of the boot process is getting hung up on these
(apparently) duplicate names, because it is looking at just the short
names instead of the fully qualified names. (In the udev scripts
perhaps?) If that code looked at hostname:name instead of just name,
perhaps this problem would disappear.
--
You received this bug notification because you are a member of Ubuntu
Foundations Bugs, which is subscribed to mdadm in Ubuntu.
https://bugs.launchpad.net/bugs/1036366
Title:
software RAID arrays fail to start on boot
Status in “mdadm” package in Ubuntu:
Invalid
Bug description:
Some software RAID arrays fail to start on boot. Exactly two of my
arrays (but not always the same two!) do not start, on every single
boot, and I have done 24 boots since I started taking detailed notes.
Have been running Ubuntu 12.04 with latest updates. Two days ago I
selectively upgraded mdadm to 3.2.5 from -proposed, as suggested in
bug #942106; that upgrade helped some other people, but not me. Over
the last few months, various updates in kernel and mdadm have resulted
in great improvement of symptoms, but no complete cure so far.
Note that the following symptoms once regularly occurred on this system, but have NOT occurred in the past few weeks:
- Having to wait for a degraded array to resync
- Having to manually re-attach a component (usually a spare) that had become detached
- Having to drop to the command line to zero a superblock before reattaching a component
- Having an array containing swap fail to start
- Having to use anything other than Disk Utility to get arrays running properly again
This system has six SATA drives on two controllers. It contains seven RAID arrays, including RAID 1, RAID 10, and RAID 6; all are listed in fstab. Some use 0.90.0 metadata and some use 1.2 metadata. The root filesystem is not on a RAID array (at least not any more; I got tired of that REAL fast) but everything else (including /boot and all swap) is on RAID. One array is used for /boot, two for swap, and the other four are just there for testing purposes.
BOOT_DEGRADED is set. All partitions are GPT. Not using LUKS or LVM.
All drives are 2TB and by various manufacturers, and I suspect some
have 512B physical sectors and some have 2KB sectors. This is an
AMD64 system with 8GB RAM.
This system has had about four different versions of Ubuntu on it over the last few years, and has had multiple RAID arrays on it from the beginning. (This is why some of the arrays are still using 0.90.0 metadata, and why there are so many arrays; some arrays are old partitions containing root and home and such from earlier incarnations.) RAID worked fine until the system was upgraded to Oneiric early in 2012 (no, the problem did not start with Precise).
I have carefully tested the system every time an updated kernel or
mdadm has appeared, ever since the problem started. The behavior has
gradually improved over the last several months. This latest proposed
version of mdadm (3.2.5), thankfully, did not result in regressions,
but also did not result in significant improvement on this system;
have rebooted five times since then and the behavior is consistent.
When the problem first started, on Oneiric, I had the root file system on RAID. This was unpleasant. I stopped using the system for a while, as I had another one running Maverick, which was reliable.
When I noticed some discussion of possibly related bugs on the Linux
RAID list (I've been lurking there for years) I decided to test the
system some more. By then Precise was out, so I upgraded. That did
not help. Eventually I backed up all data onto another system and did
a clean install of Precise on a non-RAID partition, which made the
system tolerable. I left /boot on a RAID1 array (on all six drives),
but that does not prevent the system from booting even if /boot does
not start during Ubuntu startup (I assume because GRUB can find /boot
even if Ubuntu later can't).
I started taking detailed notes in May (seven cramped pages so far).
Have rebooted 24 times since then. On every boot, exactly two arrays
did not start. Which arrays they were, varied from boot to boot;
could be any of the arrays (but recently, swap arrays are not
affected). No apparent correlation with metadata type or RAID level.
ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: mdadm 3.2.5-1ubuntu0.2
ProcVersionSignature: Ubuntu 3.2.0-29.46-generic 3.2.24
Uname: Linux 3.2.0-29-generic x86_64
ApportVersion: 2.0.1-0ubuntu12
Architecture: amd64
Date: Mon Aug 13 12:10:36 2012
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425)
MDadmExamine.dev.sda:
/dev/sda:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
MDadmExamine.dev.sda1: Error: command ['/sbin/mdadm', '-E', '/dev/sda1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda1.
MDadmExamine.dev.sda11: Error: command ['/sbin/mdadm', '-E', '/dev/sda11'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda11.
MDadmExamine.dev.sda4: Error: command ['/sbin/mdadm', '-E', '/dev/sda4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda4.
MDadmExamine.dev.sda5: Error: command ['/sbin/mdadm', '-E', '/dev/sda5'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda5.
MDadmExamine.dev.sda6: Error: command ['/sbin/mdadm', '-E', '/dev/sda6'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda6.
MDadmExamine.dev.sda7: Error: command ['/sbin/mdadm', '-E', '/dev/sda7'] failed with exit code 1: mdadm: No md superblock detected on /dev/sda7.
MDadmExamine.dev.sdb:
/dev/sdb:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
MDadmExamine.dev.sdb1: Error: command ['/sbin/mdadm', '-E', '/dev/sdb1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdb1.
MDadmExamine.dev.sdb11: Error: command ['/sbin/mdadm', '-E', '/dev/sdb11'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdb11.
MDadmExamine.dev.sdb4: Error: command ['/sbin/mdadm', '-E', '/dev/sdb4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdb4.
MDadmExamine.dev.sdb5: Error: command ['/sbin/mdadm', '-E', '/dev/sdb5'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdb5.
MDadmExamine.dev.sdb6: Error: command ['/sbin/mdadm', '-E', '/dev/sdb6'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdb6.
MDadmExamine.dev.sdb7: Error: command ['/sbin/mdadm', '-E', '/dev/sdb7'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdb7.
MDadmExamine.dev.sdc:
/dev/sdc:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
MDadmExamine.dev.sdc1: Error: command ['/sbin/mdadm', '-E', '/dev/sdc1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdc1.
MDadmExamine.dev.sdc4: Error: command ['/sbin/mdadm', '-E', '/dev/sdc4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdc4.
MDadmExamine.dev.sdc5: Error: command ['/sbin/mdadm', '-E', '/dev/sdc5'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdc5.
MDadmExamine.dev.sdc6: Error: command ['/sbin/mdadm', '-E', '/dev/sdc6'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdc6.
MDadmExamine.dev.sdc7: Error: command ['/sbin/mdadm', '-E', '/dev/sdc7'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdc7.
MDadmExamine.dev.sdd:
/dev/sdd:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
MDadmExamine.dev.sdd1: Error: command ['/sbin/mdadm', '-E', '/dev/sdd1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdd1.
MDadmExamine.dev.sdd4: Error: command ['/sbin/mdadm', '-E', '/dev/sdd4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdd4.
MDadmExamine.dev.sdd5: Error: command ['/sbin/mdadm', '-E', '/dev/sdd5'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdd5.
MDadmExamine.dev.sdd6: Error: command ['/sbin/mdadm', '-E', '/dev/sdd6'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdd6.
MDadmExamine.dev.sdd7: Error: command ['/sbin/mdadm', '-E', '/dev/sdd7'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdd7.
MDadmExamine.dev.sde: Error: command ['/sbin/mdadm', '-E', '/dev/sde'] failed with exit code 1: mdadm: cannot open /dev/sde: No medium found
MDadmExamine.dev.sdf:
/dev/sdf:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
MDadmExamine.dev.sdf1: Error: command ['/sbin/mdadm', '-E', '/dev/sdf1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdf1.
MDadmExamine.dev.sdf11: Error: command ['/sbin/mdadm', '-E', '/dev/sdf11'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdf11.
MDadmExamine.dev.sdf4: Error: command ['/sbin/mdadm', '-E', '/dev/sdf4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdf4.
MDadmExamine.dev.sdf5: Error: command ['/sbin/mdadm', '-E', '/dev/sdf5'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdf5.
MDadmExamine.dev.sdf6: Error: command ['/sbin/mdadm', '-E', '/dev/sdf6'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdf6.
MDadmExamine.dev.sdf7: Error: command ['/sbin/mdadm', '-E', '/dev/sdf7'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdf7.
MDadmExamine.dev.sdg:
/dev/sdg:
MBR Magic : aa55
Partition[0] : 3907029167 sectors at 1 (type ee)
MDadmExamine.dev.sdg1: Error: command ['/sbin/mdadm', '-E', '/dev/sdg1'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdg1.
MDadmExamine.dev.sdg11: Error: command ['/sbin/mdadm', '-E', '/dev/sdg11'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdg11.
MDadmExamine.dev.sdg4: Error: command ['/sbin/mdadm', '-E', '/dev/sdg4'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdg4.
MDadmExamine.dev.sdg5: Error: command ['/sbin/mdadm', '-E', '/dev/sdg5'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdg5.
MDadmExamine.dev.sdg6: Error: command ['/sbin/mdadm', '-E', '/dev/sdg6'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdg6.
MDadmExamine.dev.sdg7: Error: command ['/sbin/mdadm', '-E', '/dev/sdg7'] failed with exit code 1: mdadm: No md superblock detected on /dev/sdg7.
MachineType: System manufacturer System Product Name
ProcEnviron:
TERM=xterm
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-29-generic root=UUID=9035fd0f-c11b-405f-82b1-875ecf527582 ro quiet splash vt.handoff=7
SourcePackage: mdadm
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 10/08/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2701
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: M3A78-EM
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2701:bd10/08/2010:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnM3A78-EM:rvrRevX.0x:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/1036366/+subscriptions
More information about the foundations-bugs
mailing list