[Bug 568616] Re: random silent corruption of TCP data
Bogdan Butnaru
bogdanb+launchpad at gmail.com
Wed Apr 28 22:20:25 UTC 2010
apport information
** Tags added: apport-collected
** Description changed:
Hello! I’m having a very strange problem.
I’m the proud reporter of bug #554749, and I think I found something
that might explain it. The short of that bug is that I’m using SSHFS to
mount some shares from my server on my desktop; randomly (a few times
each day) something goes wrong, and every program using that mount-point
freezes. (I have to do a complex evil ritual to re-mount it without
rebooting the computer.) While trying to debug it I discovered some
occasional “Corrupted MAC on input” errors. I googled a bit for it,
without much success; anyway, a post somewhere suggested I check for
network corruption with netcat.
So, I cat’ed together two movie files, obtaining a 1.4 GB file filled
with mostly random data. And I started shuttling it between the two
computers, using netcat (via the default TCP). I did a dozen transfers,
and exactly one of them was corrupted (the second, actually).
Interestingly, the corruption was exactly 128 bytes long; the replaced
data doesn’t have any obvious relationship to what was there originally.
According to ifconfig,
bogdanb at mabelode:~/tests$ ifconfig eth0 |grep errors
RX packets:9487952 errors:0 dropped:0 overruns:0 frame:0
TX packets:6132714 errors:0 dropped:0 overruns:0 carrier:2
bogdanb at tanelorn:~/tests$ ifconfig eth0|grep errors
RX packets:149100044 errors:0 dropped:0 overruns:0 frame:0
TX packets:135620981 errors:0 dropped:0 overruns:0 carrier:0
there haven’t been any transmission errors, so this being just something
that randomly passed undetected through the TCP checksum is _really_
unlikely. There’s also the suspicious length of the error.
I’d expect a tiny bug in some of the routines that shuttle data between
the NIC’s buffer and the application’s. I’ve no idea how to debug this
further, please help!
A few more notes:
*) all this happens via Ethernet; the two computers are both linked to a switch with short cables. Anyway, given the above, it doesn’t look like line errors.
*) the server runs Karmic, the desktop runs Lucid.
*) I’ve had similar (but not identical) problems with SSHFS ever since I had these two computers (around Feisty, I think); it’s likely that whatever is causing the corruption was there since the beginning, but the way SSHFS handles occurrences of the bug changed.
*) whatever it is, it’s very random. As the test showed, I got a single error after 2 GB, then no other error for the next 15 GB of transferred files. However, the SSHFS error (which I’m pretty sure is caused by this) sometimes happens after 15 minutes, sometimes I have no problems for a full day.
*) I tried reporting this with ubuntu-bug, but Launchpad timed out on me several times in a row. Please tell me whatever information you think I should add.
+
+
+
+ ---
+ AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
+ Architecture: amd64
+ AudioDevicesInUse:
+ Cannot stat file /proc/19634/fd/3: Transport endpoint is not connected
+ USER PID ACCESS COMMAND
+ /dev/snd/controlC1: bogdanb 1604 F.... pulseaudio
+ /dev/snd/controlC0: bogdanb 1604 F.... pulseaudio
+ /dev/snd/pcmC0D0p: bogdanb 1604 F...m pulseaudio
+ CRDA: Error: [Errno 2] No such file or directory
+ Card0.Amixer.info:
+ Card hw:0 'Intel'/'HDA Intel at 0xf9ff8000 irq 22'
+ Mixer name : 'Realtek ALC1200'
+ Components : 'HDA:10ec0888,104382fe,00100101'
+ Controls : 40
+ Simple ctrls : 22
+ Card1.Amixer.info:
+ Card hw:1 'Headset'/'Logitech Logitech Wireless Headset at usb-0000:00:1d.0-2, full speed'
+ Mixer name : 'USB Mixer'
+ Components : 'USB046d:0a12'
+ Controls : 4
+ Simple ctrls : 2
+ DistroRelease: Ubuntu 10.04
+ EcryptfsInUse: Yes
+ Frequency: Once a day.
+ HibernationDevice: RESUME=/dev/sdb2
+ IwConfig:
+ lo no wireless extensions.
+
+ eth0 no wireless extensions.
+ MachineType: System manufacturer P5Q-PRO
+ NonfreeKernelModules: nvidia
+ Package: linux (not installed)
+ ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-21-generic root=/dev/sda1 ro nomodeset
+ ProcEnviron:
+ LANGUAGE=en_US:en
+ PATH=(custom, user)
+ LANG=en_US.UTF-8
+ SHELL=/bin/bash
+ ProcVersionSignature: Ubuntu 2.6.32-21.32-generic 2.6.32.11+drm33.2
+ Regression: Yes
+ RelatedPackageVersions: linux-firmware 1.34
+ Reproducible: No
+ RfKill:
+
+ Tags: lucid networking regression-potential needs-upstream-testing
+ Uname: Linux 2.6.32-21-generic x86_64
+ UserAsoundrc:
+ # ALSA library configuration file
+
+ # Include settings that are under the control of asoundconf(1).
+ # (To disable these settings, comment out this line.)
+ </home/bogdanb/.asoundrc.asoundconf>
+ UserGroups: adm admin audio cdrom dialout floppy fuse lpadmin netdev plugdev sambashare scanner staff video
+ WpaSupplicantLog:
+
+ dmi.bios.date: 11/04/2008
+ dmi.bios.vendor: American Megatrends Inc.
+ dmi.bios.version: 1501
+ dmi.board.asset.tag: To Be Filled By O.E.M.
+ dmi.board.name: P5Q-PRO
+ dmi.board.vendor: ASUSTeK Computer INC.
+ dmi.board.version: Rev 1.xx
+ dmi.chassis.asset.tag: Asset-1234567890
+ dmi.chassis.type: 3
+ dmi.chassis.vendor: Chassis Manufacture
+ dmi.chassis.version: Chassis Version
+ dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1501:bd11/04/2008:svnSystemmanufacturer:pnP5Q-PRO:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP5Q-PRO:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
+ dmi.product.name: P5Q-PRO
+ dmi.product.version: System Version
+ dmi.sys.vendor: System manufacturer
** Attachment added: "AlsaDevices.txt"
http://launchpadlibrarian.net/46071511/AlsaDevices.txt
--
random silent corruption of TCP data
https://bugs.launchpad.net/bugs/568616
You received this bug notification because you are a member of Kernel
Bugs, which is subscribed to linux in ubuntu.
More information about the kernel-bugs
mailing list