[Bug 65631] Re: skge driver broken: invalid call to spin_unlock causes system crash
Alexander Schulze
schulze at mathematik.uni-kl.de
Thu Oct 12 16:54:54 UTC 2006
I just verified that this defect in the skge driver is also present in
the current git tree (URL given below) for edgy. It would be fine if it
could be fixed in time, as releasing edgy with this bug would prevent
installation of edgy on affected machines (without special workarounds),
as accessing the network would lead to instability of the kernel due to
wrong preempt_count value.
http://git.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-
edgy.git;a=blob;h=a724fea2ad52a6cb6cdbdce13f22efca88d0e7d8;hb=f23a2bfe4a30cf528c1874957f51ab86e5bb7e27;f=drivers/net/skge.c
** Description changed:
After upgrading to kernel 2.6.15-27.48, which includes version 1.5 of
the skge driver, we experienced network related lockups on some of our
machines, beginning with some "badness" warnings, followed by
"scheduling while atomic" errors and finally a system crash (no response
- to pings, locale console dead). We finally found that the skge gigabit
+ to pings, local console dead). We finally found that the skge gigabit
ethernet driver was the cause. A comparison between the 1.5 version
included in 2.6.15-27.48 in dapper/security and the 1.5 version in
vanilla 2.6.17 (where it is supposed to be taken from) showed that the
dapper version contains additional calls to spin_unlock for the hw_lock
of the skge device (and *no* spin_lock calls for hw_lock at all!),
whereas the 2.6.17 vanilla version seems to have eliminated hw_lock
completely. I therefore think that something went wrong when
"transplanting" version 1.5 to 2.6.15-27.48, and the removal of hw_lock
was not done in all places.
To verify this analysis, we are currently running a skge.ko module
compiled from a modified source where we eliminated hw_lock and all
calls to spin_* corresponding to this lock (basically the 2.6.17 version
of the driver, but with the pci_device_id patches from dapper). We have
not yet seen lock-ups from this modified driver.
Why does the lockup occur only on a subset of our machines? A quick
glance at the 1.5 code in dapper shows another locking-related coding
error: The spin_unlock is only called in the second branch of the
interrupt service routine skge_intr that seems to handle transmission
errors, while the main branch, handling data I/O, seems to be correct.
Therefore, the error becomes visible only when the bad branch in the ISR
is executed, which seems to depend on the cabling to the machine and the
network load (and fortunately our server machines have better cabling
and were therefore unaffected by this bug!).
So, when verifying this bug report, don't be surprised if you can't
reproduce it in many configurations, but just have a look at the source
and compare it to the version in 2.6.17. The difference (and the fact
that the locking in 2.6.15-27.48 is broken) is obvious. *So* obvious in
fact that I really wonder how this defective driver made its way into
dapper security... (not asking the question whether it is really
necessary to deliver not security-related (and obviously not thoroughly
tested) driver updates in a security update to a LTS version targeted at
server use at all!)
--
skge driver broken: invalid call to spin_unlock causes system crash
https://launchpad.net/bugs/65631
More information about the kernel-bugs
mailing list