[Bug 65631] Re: skge driver broken: invalid call to spin_unlock causes system crash

Thu Oct 12 16:54:54 UTC 2006

I just verified that this defect in the skge driver is also present in
the current git tree (URL given below) for edgy. It would be fine if it
could be fixed in time, as releasing edgy with this bug would prevent
installation of edgy on affected machines (without special workarounds),
as accessing the network would lead to instability of the kernel due to
wrong preempt_count value.

http://git.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-
edgy.git;a=blob;h=a724fea2ad52a6cb6cdbdce13f22efca88d0e7d8;hb=f23a2bfe4a30cf528c1874957f51ab86e5bb7e27;f=drivers/net/skge.c

** Description changed:

  After upgrading to kernel 2.6.15-27.48, which includes version 1.5 of
  the skge driver, we experienced network related lockups on some of our
  machines, beginning with some "badness" warnings, followed by
  "scheduling while atomic" errors and finally a system crash (no response
- to pings, locale console dead). We finally found that the skge gigabit
+ to pings, local console dead). We finally found that the skge gigabit
  ethernet driver was the cause. A comparison between the 1.5 version
  included in 2.6.15-27.48 in dapper/security and the 1.5 version in
  vanilla 2.6.17 (where it is supposed to be taken from) showed that the
  dapper version contains additional calls to spin_unlock for the hw_lock
  of the skge device (and *no* spin_lock calls for hw_lock at all!),
  whereas the 2.6.17 vanilla version seems to have eliminated hw_lock
  completely. I therefore think that something went wrong when
  "transplanting" version 1.5 to 2.6.15-27.48, and the removal of hw_lock
  was not done in all places.

  To verify this analysis, we are currently running a skge.ko module
  compiled from a modified source where we eliminated hw_lock and all
  calls to spin_* corresponding to this lock (basically the 2.6.17 version
  of the driver, but with the pci_device_id patches from dapper). We have
  not yet seen lock-ups from this modified driver.

  Why does the lockup occur only on a subset of our machines? A quick
  glance at the 1.5 code in dapper shows another locking-related coding
  error: The spin_unlock is only called in the second branch of the
  interrupt service routine skge_intr that seems to handle transmission
  errors, while the main branch, handling data I/O, seems to be correct.
  Therefore, the error becomes visible only when the bad branch in the ISR
  is executed, which seems to depend on the cabling to the machine and the
  network load (and fortunately our server machines have better cabling
  and were therefore unaffected by this bug!).

  So, when verifying this bug report, don't be surprised if you can't
  reproduce it in many configurations, but just have a look at the source
  and compare it to the version in 2.6.17. The difference (and the fact
  that the locking in 2.6.15-27.48 is broken) is obvious. *So* obvious in
  fact that I really wonder how this defective driver made its way into
  dapper security... (not asking the question whether it is really
  necessary to deliver not security-related (and obviously not thoroughly
  tested) driver updates in a security update to a LTS version targeted at
  server use at all!)

-- 
skge driver broken: invalid call to spin_unlock causes system crash
https://launchpad.net/bugs/65631