[Bug 214814] Re: BUG: soft lockup - CPU#0 stuck for 61s!

TJ ubuntu at tjworld.net
Wed Apr 9 21:39:18 UTC 2008


** Attachment added: "Gutsy i450NX Dell PowerEdge 6300 fix"
   http://launchpadlibrarian.net/13298582/ubuntu-gutsy-pci-i450nx-no-secondary-bus-scan-poweredge-6300.diff

** Changed in: linux (Ubuntu)
   Importance: Undecided => High
     Assignee: (unassigned) => TJ (intuitivenipple)
       Status: New => In Progress
       Target: None => ubuntu-8.04

** Bug watch added: Linux Kernel Bug Tracker #10396
   http://bugzilla.kernel.org/show_bug.cgi?id=10396

** Also affects: linux via
   http://bugzilla.kernel.org/show_bug.cgi?id=10396
   Importance: Unknown
       Status: Unknown

** Description changed:

  See also upstream bug:
  
  http://bugzilla.kernel.org/show_bug.cgi?id=10396
  
  Systems based on the Intel 450NX chipset may experience issues where
  devices aren't recognised that lead to drivers failing, unhandled IRQs,
  and other serious boot failures. The issue is caused because this
  chipset has 3 PCI root buses. When it was first released some operating
  systems (read: Windows NT) didn't always correctly discover the 2nd and
  3rd PCI buses. As a result the PCI BIOS tables were 'hacked' to have a
  fake bridge device on PCI bus 0 that points to the same bus number as
  the 1st bus so they would be scanned correctly by the OS.
  
  $ lspci
  00:0a.0 PCI bridge: Intel Corporation 21154 PCI-to-PCI Bridge
  00:10.0 Host bridge: Intel Corporation 450NX - 82451NX Memory & I/O Controller (rev 03)
  00:12.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
  00:13.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
  00:14.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
  
  As a result, in a well-behaved OS the 2nd and 3rd PCI buses would be
  scanned twice. Once as secondaries of the 1st bus, and then as root
  buses in their own right. This caused problems with devices being
  discovered twice.
  
  A fix-up for all i450N chipsets was introduced in
  arch/i386/pci/fixups.c::pci_fixup_i450nx(). Note: arch/i386 was
  refactored to arch/x86/ subsequently. The fix-up checks the PCI config
  for the subsidiary buses and if it finds them scans them. This adds them
  to the root_pci_bus list. Later in the boot process the ACPI/PCI code
  reads the ACPI DSDT table, finds the PCI bus entries (PNP0A03) and tries
  to scan them. It fails when scanning the 2nd and 3rd buses with:
  
  [    0.910906] ACPI: PCI Root Bridge [PX0B] (0000:02)
  [    0.912085] ACPI: Bus 0000:02 not present in PCI namespace
  [    0.917111] ACPI: PCI Root Bridge [PX1A] (0000:03)
  [    0.920085] ACPI: Bus 0000:03 not present in PCI namespace
  
  Unfortunately, the report is misleading since the reason is that the bus
  is found to be already registered and therefore ignored. The situation
  can be worked around by booting with "pci=noacpi".
  
  The solution is to make the pci_fixup_i450nx() code selected based on
  the DMI of the system. I've introduced a patch that does this. Initially
  the only DMI it will match is Dell PowerEdge 6300 but if other systems
  are found to be affected the output of "sudo dmidecode" should be
  captured and reported. Additional DMI_MATCH entries can then be added to
  the patch.
  
  I found this reference to the issue in AKM's 2.6.0 mm tree and the
  linux-scsi mailing list archive:
  
  "I can tell you what's going on here.  This is a 450NX based
  motherboard.  The 450NX chipset from Intel was the first chipset to have
  peer PCI busses.  For backwards compatibility, some machine makers
  hacked their PCI BIOS to have a fake bridge device on PCI bus 0 that
  points to the same bus number as the peer bus.  This way if the OS
  didn't know about the peer bus registers it would still find the devices
  by scanning behind the bridge.  In this case we are scanning behind this
  fake bridge and then also scanning based upon the peer bus registers in
  the chipset, and as a result we are finding the device twice.  In order
  to fix this problem you need to change the peer bus quirk code for the
  450NX chipset to scan the list of bus 0 devices looking for a bridge
  that has the same config as the peer bus registers and if so delete the
  bridge from the list.  That will avoid double scanning and will avoid
  having the PCI code try and configure sub busses via a fake bridge when
  it should do all configurations via the 450NX peer bus registers.
  
  -- 
    Doug Ledford <dledford at redhat.com>"
  
  http://marc.info/?l=linux-scsi&m=106839680416899&w=2
  
  In this particular case a Dell PowerEdge 6300 with a PERC 2 RAID array
- controller (aacraid) fails to boot on any kernel after v.2.6.20
- (Feisty). Reports show:
+ controller (aacraid) fails to boot on any kernel after v2.6.20 (Feisty).
+ Reports show:
  
  [ 0.000000] Linux version 2.6.24-15-generic (root at PowerEdge6300) (gcc
  version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #1 SMP Fri Apr 4 09:18:39 BST
  2008 (Ubuntu 2.6.24-15.26-generic)
  
  [ 436.079664] Adaptec aacraid driver 1.1-5[2449]-ms
  
  [ 492.476969] BUG: soft lockup - CPU#2 stuck for 11s! [modprobe:1376]
  [ 492.483317]
  [ 492.484874] Pid: 1376, comm: modprobe Not tainted (2.6.24-15-generic #1)
  [ 492.491642] EIP: 0060:[<c0216641>] EFLAGS: 00000287 CPU: 2
  [ 492.497226] EIP is at delay_tsc+0x41/0x50
  [ 492.501302] EAX: 0000059e EBX: 0000003f ECX: 00000000 EDX: 0000003f
  [ 492.507640] ESI: 17c02b3e EDI: df84f278 EBP: 17c025a0 ESP: df9dfd4c
  [ 492.513972] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
  [ 492.519443] CR0: 8005003b CR2: 0812574c CR3: 1f97b000 CR4: 00000690
  [ 492.525781] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
  [ 492.532114] DR6: ffff0ff0 DR7: 00000400
  [ 492.536029] [<c02165c6>] __delay+0x6/0x10
  [ 492.540264] [<f89496aa>] aac_fib_send+0x21a/0x2d0 [aacraid]
  [ 492.546108] [<c012363a>] enqueue_task_fair+0x1a/0x30
  [ 492.551318] [<f8945a94>] aac_get_adapter_info+0x74/0x620 [aacraid]
  [ 492.557753] [<f8942f54>] aac_probe_one+0x224/0x450 [aacraid]
  [ 492.563642] [<f8949b80>] aac_command_thread+0x0/0x6d0 [aacraid]
  [ 492.569801] [<c0223136>] pci_device_probe+0x56/0x80
  [ 492.574903] [<c027e85e>] driver_probe_device+0x8e/0x190
  [ 492.580373] [<c027eace>] __driver_attach+0x9e/0xa0
  [ 492.585385] [<c027dc7b>] bus_for_each_dev+0x3b/0x60
  [ 492.590491] [<c027e6d6>] driver_attach+0x16/0x20
  [ 492.595330] [<c027ea30>] __driver_attach+0x0/0xa0
  [ 492.600259] [<c027e00a>] bus_add_driver+0x8a/0x1e0
  [ 492.605281] [<c02232e3>] __pci_register_driver+0x53/0xa0
  [ 492.610815] [<f8850033>] aac_init+0x33/0x74 [aacraid]
  [ 492.616098] [<c0151511>] sys_init_module+0x151/0x1990
  [ 492.621377] [<c01778fa>] __do_fault+0x21a/0x410
  [ 492.626170] [<c0166421>] handle_fasteoi_irq+0x91/0xf0
  [ 492.631465] [<c01053b2>] syscall_call+0x7/0xb
  [ 492.636066] =======================
  
  [   17.155571] irq 10: nobody cared (try booting with the "irqpoll" option)
  [   17.155571] Pid: 0, comm: swapper Not tainted 2.6.25-rc8-custom #1
  [   17.155571]  [<c025ad74>] __report_bad_irq+0x24/0x80
  
  This was first thought to be part of bug #149071 "-server kernel variant
  fails to boot on PowerEdge 2650 with AACRAID timeouts" but it now
  appears likely that has a different root cause.
  
  Attached here are patches for Gutsy and Hardy. An upstream patch for
  v2.6.25-rc8 is attached to the bugzilla report.

** Description changed:

  See also upstream bug:
  
  http://bugzilla.kernel.org/show_bug.cgi?id=10396
  
  Systems based on the Intel 450NX chipset may experience issues where
  devices aren't recognised that lead to drivers failing, unhandled IRQs,
  and other serious boot failures. The issue is caused because this
  chipset has 3 PCI root buses. When it was first released some operating
  systems (read: Windows NT) didn't always correctly discover the 2nd and
  3rd PCI buses. As a result the PCI BIOS tables were 'hacked' to have a
  fake bridge device on PCI bus 0 that points to the same bus number as
  the 1st bus so they would be scanned correctly by the OS.
  
  $ lspci
  00:0a.0 PCI bridge: Intel Corporation 21154 PCI-to-PCI Bridge
  00:10.0 Host bridge: Intel Corporation 450NX - 82451NX Memory & I/O Controller (rev 03)
  00:12.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
  00:13.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
  00:14.0 Host bridge: Intel Corporation 450NX - 82454NX/84460GX PCI Expander Bridge (rev 04)
  
  As a result, in a well-behaved OS the 2nd and 3rd PCI buses would be
  scanned twice. Once as secondaries of the 1st bus, and then as root
  buses in their own right. This caused problems with devices being
  discovered twice.
  
  A fix-up for all i450N chipsets was introduced in
  arch/i386/pci/fixups.c::pci_fixup_i450nx(). Note: arch/i386 was
  refactored to arch/x86/ subsequently. The fix-up checks the PCI config
  for the subsidiary buses and if it finds them scans them. This adds them
  to the root_pci_bus list. Later in the boot process the ACPI/PCI code
  reads the ACPI DSDT table, finds the PCI bus entries (PNP0A03) and tries
  to scan them. It fails when scanning the 2nd and 3rd buses with:
  
  [    0.910906] ACPI: PCI Root Bridge [PX0B] (0000:02)
  [    0.912085] ACPI: Bus 0000:02 not present in PCI namespace
  [    0.917111] ACPI: PCI Root Bridge [PX1A] (0000:03)
  [    0.920085] ACPI: Bus 0000:03 not present in PCI namespace
  
  Unfortunately, the report is misleading since the reason is that the bus
  is found to be already registered and therefore ignored. The situation
  can be worked around by booting with "pci=noacpi".
  
- The solution is to make the pci_fixup_i450nx() code selected based on
+ The solution is to make the pci_fixup_i450nx() code selective based on
  the DMI of the system. I've introduced a patch that does this. Initially
  the only DMI it will match is Dell PowerEdge 6300 but if other systems
  are found to be affected the output of "sudo dmidecode" should be
  captured and reported. Additional DMI_MATCH entries can then be added to
  the patch.
  
  I found this reference to the issue in AKM's 2.6.0 mm tree and the
  linux-scsi mailing list archive:
  
  "I can tell you what's going on here.  This is a 450NX based
  motherboard.  The 450NX chipset from Intel was the first chipset to have
  peer PCI busses.  For backwards compatibility, some machine makers
  hacked their PCI BIOS to have a fake bridge device on PCI bus 0 that
  points to the same bus number as the peer bus.  This way if the OS
  didn't know about the peer bus registers it would still find the devices
  by scanning behind the bridge.  In this case we are scanning behind this
  fake bridge and then also scanning based upon the peer bus registers in
  the chipset, and as a result we are finding the device twice.  In order
  to fix this problem you need to change the peer bus quirk code for the
  450NX chipset to scan the list of bus 0 devices looking for a bridge
  that has the same config as the peer bus registers and if so delete the
  bridge from the list.  That will avoid double scanning and will avoid
  having the PCI code try and configure sub busses via a fake bridge when
  it should do all configurations via the 450NX peer bus registers.
  
  -- 
    Doug Ledford <dledford at redhat.com>"
  
  http://marc.info/?l=linux-scsi&m=106839680416899&w=2
  
  In this particular case a Dell PowerEdge 6300 with a PERC 2 RAID array
  controller (aacraid) fails to boot on any kernel after v2.6.20 (Feisty).
  Reports show:
  
  [ 0.000000] Linux version 2.6.24-15-generic (root at PowerEdge6300) (gcc
  version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #1 SMP Fri Apr 4 09:18:39 BST
  2008 (Ubuntu 2.6.24-15.26-generic)
  
  [ 436.079664] Adaptec aacraid driver 1.1-5[2449]-ms
  
  [ 492.476969] BUG: soft lockup - CPU#2 stuck for 11s! [modprobe:1376]
  [ 492.483317]
  [ 492.484874] Pid: 1376, comm: modprobe Not tainted (2.6.24-15-generic #1)
  [ 492.491642] EIP: 0060:[<c0216641>] EFLAGS: 00000287 CPU: 2
  [ 492.497226] EIP is at delay_tsc+0x41/0x50
  [ 492.501302] EAX: 0000059e EBX: 0000003f ECX: 00000000 EDX: 0000003f
  [ 492.507640] ESI: 17c02b3e EDI: df84f278 EBP: 17c025a0 ESP: df9dfd4c
  [ 492.513972] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
  [ 492.519443] CR0: 8005003b CR2: 0812574c CR3: 1f97b000 CR4: 00000690
  [ 492.525781] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
  [ 492.532114] DR6: ffff0ff0 DR7: 00000400
  [ 492.536029] [<c02165c6>] __delay+0x6/0x10
  [ 492.540264] [<f89496aa>] aac_fib_send+0x21a/0x2d0 [aacraid]
  [ 492.546108] [<c012363a>] enqueue_task_fair+0x1a/0x30
  [ 492.551318] [<f8945a94>] aac_get_adapter_info+0x74/0x620 [aacraid]
  [ 492.557753] [<f8942f54>] aac_probe_one+0x224/0x450 [aacraid]
  [ 492.563642] [<f8949b80>] aac_command_thread+0x0/0x6d0 [aacraid]
  [ 492.569801] [<c0223136>] pci_device_probe+0x56/0x80
  [ 492.574903] [<c027e85e>] driver_probe_device+0x8e/0x190
  [ 492.580373] [<c027eace>] __driver_attach+0x9e/0xa0
  [ 492.585385] [<c027dc7b>] bus_for_each_dev+0x3b/0x60
  [ 492.590491] [<c027e6d6>] driver_attach+0x16/0x20
  [ 492.595330] [<c027ea30>] __driver_attach+0x0/0xa0
  [ 492.600259] [<c027e00a>] bus_add_driver+0x8a/0x1e0
  [ 492.605281] [<c02232e3>] __pci_register_driver+0x53/0xa0
  [ 492.610815] [<f8850033>] aac_init+0x33/0x74 [aacraid]
  [ 492.616098] [<c0151511>] sys_init_module+0x151/0x1990
  [ 492.621377] [<c01778fa>] __do_fault+0x21a/0x410
  [ 492.626170] [<c0166421>] handle_fasteoi_irq+0x91/0xf0
  [ 492.631465] [<c01053b2>] syscall_call+0x7/0xb
  [ 492.636066] =======================
  
  [   17.155571] irq 10: nobody cared (try booting with the "irqpoll" option)
  [   17.155571] Pid: 0, comm: swapper Not tainted 2.6.25-rc8-custom #1
  [   17.155571]  [<c025ad74>] __report_bad_irq+0x24/0x80
  
  This was first thought to be part of bug #149071 "-server kernel variant
  fails to boot on PowerEdge 2650 with AACRAID timeouts" but it now
  appears likely that has a different root cause.
  
  Attached here are patches for Gutsy and Hardy. An upstream patch for
  v2.6.25-rc8 is attached to the bugzilla report.

-- 
BUG: soft lockup - CPU#0 stuck for 61s!
https://bugs.launchpad.net/bugs/214814
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to Linux.




More information about the kernel-bugs mailing list