[Bug 1155556] Re: HP ProLiant DL380 G7 tftps kernel, but initrd tracebacks in tftp server. DL380 G6 succeeds.

Nick Moffitt nick.moffitt at canonical.com
Fri Mar 15 17:42:40 UTC 2013


Right, so after a day spent with Daviey and a bunch of 30MB pcap files,
we think we've figured this out.

the key exchange that failed happens here:


 7418	112.051626	10.55.200.99	10.55.200.1	TFTP	Read Request, File: amd64/generic/quantal/commissioning/initrd.gz, Transfer type: octet, tsize\000=0\000, blksize\000=1408\000
 7419	112.053444	10.55.200.1	10.55.200.99	TFTP	Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7420	113.053489	10.55.200.1	10.55.200.99	TFTP	Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7423	116.053542	10.55.200.1	10.55.200.99	TFTP	Option Acknowledgement, tsize\000=18988167\000, blksize\000=1400\000
 7425	116.832761	10.55.200.99	10.55.200.1	TFTP	Acknowledgement, Block: 0

The client requests the initrd, but something in the firmware or
pxelinux itself gets hung for almost five seconds.  During that time,
the maas tftpd sends three ACKs (option acknowledgements, specifically),
and times out.  By the time the client sends the ACK-0 to start the data
transfer, the session state has been discarded and the tftpd just loggs
the exception as an OOPS and waits for the next session to start.

Incidentally, we spent a lot of time correlating requested/actual block
sizes for a while between this tftpd and the HPA tftpd.  That turned out
to be a red herring, of course, but it seemed like a compelling lead for
a while.  The solution did come from a comparision to tftpd-hpa, though.

In a few places in tftp/bootstrap.py and tftp/session.py there are
timeout tuples set to (1, 3, 7).  The iterable is consumed by the
watchdog code every time a packet is sent out, and once the iterable is
empty the watchdog tells the state machine to give up on the request.
We never dug too far into the units or where in the conversation these
things are read, but the fact that there are three times in the tuple
and that the daemon gave up after three ACKs is a compelling
coïncidence.

The tftpd-hpa code tries six times, waiting one second each:

    <Daviey> Spads: #define TIMEOUT 1000000         /* Default timeout (us) */
    <Daviey> #define TRIES   6               /* Number of attempts to send each packet */
    <Daviey> #define TIMEOUT_LIMIT ((1 << TRIES)-1)

Extending the tuple at line 346 of bootstrap.py solved this situation
for us, and the maas tftpd succeeded just as tftpd-hpa.  In the end we
settled on:

    class RemoteOriginReadSession(TFTPBootstrap):
        """Bootstraps a L{ReadSession}, that was started remotely, - we've received
        a RRQ.

        """
        timeout = (1, 1, 1, 1, 1, 1)

...as this more closely mimics what Daviey found in the tftpd-hpa
source.

This timeout tuple appears in a few places, so any adjustments to this
code should probably be made to all of the timeout iterables in
bootstrap.py and session.py.

Finally, while it's true that this seems to be a workaround for a fault
on the client side (whether the fault is in firmware or in pxelinux.0 I
can't say), I believe it is also a regression against the precise maas,
which used cobbler.

-- 
You received this bug notification because you are a member of Ubuntu
Server Team, which is subscribed to python-tx-tftp in Ubuntu.
https://bugs.launchpad.net/bugs/1155556

Title:
  HP ProLiant DL380 G7 tftps kernel, but initrd tracebacks in tftp
  server.  DL380 G6 succeeds.

To manage notifications about this bug go to:
https://bugs.launchpad.net/maas/+bug/1155556/+subscriptions



More information about the Ubuntu-server-bugs mailing list