[Bug 568616] Re: random silent corruption of TCP data
Bogdan Butnaru
bogdanb+launchpad at gmail.com
Thu Apr 22 19:05:42 UTC 2010
** Description changed:
- bug bug bug
+ Hello! I’m having a very strange problem.
+
+ I’m the proud reporter of bug #554749, and I think I found something
+ that might explain it. The short of that bug is that I’m using SSHFS to
+ mount some shares from my server on my desktop; randomly (a few times
+ each day) something goes wrong, and every program using that mount-point
+ freezes. (I have to do a complex evil ritual to re-mount it without
+ rebooting the computer.) While trying to debug it I discovered some
+ occasional “Corrupted MAC on input” errors. I googled a bit for it,
+ without much success; anyway, a post somewhere suggested I check for
+ network corruption with netcat.
+
+ So, I cat’ed together two movie files, obtaining a 1.4 GB file filled
+ with mostly random data. And I started shuttling it between the two
+ computers, using netcat (via the default TCP). I did a dozen transfers,
+ and exactly one of them was corrupted (the second, actually).
+ Interestingly, the corruption was exactly 128 bytes long; the replaced
+ data doesn’t have any obvious relationship to what was there originally.
+
+ According to ifconfig,
+
+ bogdanb at mabelode:~/tests$ ifconfig eth0 |grep errors
+ RX packets:9487952 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:6132714 errors:0 dropped:0 overruns:0 carrier:2
+ bogdanb at tanelorn:~/tests$ ifconfig eth0|grep errors
+ RX packets:149100044 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:135620981 errors:0 dropped:0 overruns:0 carrier:0
+
+ there haven’t been any transmission errors, so this being just something
+ that randomly passed undetected through the TCP checksum is _really_
+ unlikely. There’s also the suspicious length of the error.
+
+ I’d expect a tiny bug in some of the routines that shuttle data between
+ the NIC’s buffer and the application’s. I’ve no idea how to debug this
+ further, please help!
+
+
+ A few more notes:
+ *) all this happens via Ethernet; the two computers are both linked to a switch with short cables. Anyway, given the above, it doesn’t look like line errors.
+ *) the server runs Karmic, the desktop runs Lucid.
+ *) I’ve had similar (but not identical) problems with SSHFS ever since I had these two computers (around Feisty, I think); it’s likely that whatever is causing the corruption was there since the beginning, but the way SSHFS handles occurrences of the bug changed.
+ *) whatever it is, it’s very random. As the test showed, I got a single error after 2 GB, then no other error for the next 15 GB of transferred files. However, the SSHFS error (which I’m pretty sure is caused by this) sometimes happens after 15 minutes, sometimes I have no problems for a full day.
+ *) I tried reporting this with ubuntu-bug, but Launchpad timed out on me several times in a row. Please tell me whatever information you think I should add.
--
random silent corruption of TCP data
https://bugs.launchpad.net/bugs/568616
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to linux in ubuntu.
More information about the kernel-bugs
mailing list