[Bug 65631] skge driver broken: invalid call to spin_unlock causes system crash
Alexander Schulze
schulze at mathematik.uni-kl.de
Thu Oct 12 08:32:35 UTC 2006
Public bug reported:
After upgrading to kernel 2.6.15-27.48, which includes version 1.5 of
the skge driver, we experienced network related lockups on some of our
machines, beginning with some "badness" warnings, followed by
"scheduling while atomic" errors and finally a system crash (no response
to pings, locale console dead). We finally found that the skge gigabit
ethernet driver was the cause. A comparison between the 1.5 version
included in 2.6.15-27.48 in dapper/security and the 1.5 version in
vanilla 2.6.17 (where it is supposed to be taken from) showed that the
dapper version contains additional calls to spin_unlock for the hw_lock
of the skge device (and *no* spin_lock calls for hw_lock at all!),
whereas the 2.6.17 vanilla version seems to have eliminated hw_lock
completely. I therefore think that something went wrong when
"transplanting" version 1.5 to 2.6.15-27.48, and the removal of hw_lock
was not done in all places.
To verify this analysis, we are currently running a skge.ko module
compiled from a modified source where we eliminated hw_lock and all
calls to spin_* corresponding to this lock (basically the 2.6.17 version
of the driver, but with the pci_device_id patches from dapper). We have
not yet seen lock-ups from this modified driver.
Why does the lockup occur only on a subset of our machines? A quick
glance at the 1.5 code in dapper shows another locking-related coding
error: The spin_unlock is only called in the second branch of the
interrupt service routine skge_intr that seems to handle transmission
errors, while the main branch, handling data I/O, seems to be correct.
Therefore, the error becomes visible only when the bad branch in the ISR
is executed, which seems to depend on the cabling to the machine and the
network load (and fortunately our server machines have better cabling
and were therefore unaffected by this bug!).
So, when verifying this bug report, don't be surprised if you can't
reproduce it in many configurations, but just have a look at the source
and compare it to the version in 2.6.17. The difference (and the fact
that the locking in 2.6.15-27.48 is broken) is obvious. *So* obvious in
fact that I really wonder how this defective driver made its way into
dapper security... (not asking the question whether it is really
necessary to deliver not security-related (and obviously not thoroughly
tested) driver updates in a security update to a LTS version targeted at
server use at all!)
** Affects: linux-source-2.6.15 (Ubuntu)
Importance: Undecided
Status: Unconfirmed
--
skge driver broken: invalid call to spin_unlock causes system crash
https://launchpad.net/bugs/65631
More information about the kernel-bugs
mailing list