[RFC][BUG #226769] strace test side-effect hangs selftest
Vincent Ladeuil
v.ladeuil+lp at free.fr
Mon May 5 08:15:37 BST 2008
While submitting a merge request to pqm, selftest hanged.
It turned out the submission revealed an hidden bug (probably
related to bug #103133).
I was able to reproduce it, on gutsy with bzr dev the patch
proposed at
http://bundlebuggy.aaronbentley.com/request/%3Cm24p9ml4l3.fsf@free.fr%3E
applied) applied, with:
python2.4 bzr selftest
bzrlib.tests.test_strace.TestStrace.test_strace_callable_is_called
bzrlib.tests.test_strace.TestStrace.test_strace_callable_result
bzrlib.tests.blackbox.test_serve.TestBzrServe.test_bzr_connect_to_bzr_ssh
Omitting any of the three tests doesn't exhibit the hang. IMHO,
that's sufficient to demonstrate the correlation between a test
(bzrlib.tests.blackbox.test_serve.TestBzrServe.test_bzr_connect_to_bzr_ssh)
leaving a sub-process and probably one or two threads alive after
completion and an invocation of strace on a process which is
itself the parent of the strace process, said process being
killed.
Researching strace and ptrace man pages (and bug #103133)
suggests that we may be trying to use strace in a configuration
that was not taken into account in its design.
Several ways to address the initial problem come to mind:
1) Disable strace tests for python2.4 (I wasn't able to reproduce
it with python 2.5 but it occurred on the pqm machine with
2.4.2 and I can reproduce it with 2.4.4)
2) Fix TestBzrServe.test_bzr_connect_to_bzr_ssh so that the
tearDown method does a better job at shutting down its server,
3) Implement a protection mechanism in selftest as proposed in
https://bugs.launchpad.net/bzr/+bug/69978/comments/2
I was tempted to propose to get rid of strace entirely from bzr
core, may be kept as a plugin, until strace itself is fixed. I'm
now convinced that the bug is in strace and that keeping these
tests in the suite exposes us to further hangs in the future.
Note that running the three tests above doesn't hang the test
suite once in a while.
The only plausible scenario I came with to explain that is that
the ssh server sometimes shut down before strace is called (yes,
the strace tests are run *after* the ssh tests) which means that
no process nor threads are left for strace.
So, my proposal is:
a) Skip the tests if we run under python2.4
b) Fix 2 above (filling a new bug first),
c) Implement 3 but knowing that strace trap signals make me think
that using alarm() may not be the best option. May be a
watchdog thread, handled by selftest itself, killing the test
if it runs longer than a pre-defined value (does 15 minutes
seems appropriate for a default ? I'd hate false positives
here). Now, if sleep is actually implemented above SIGALRM...
Feedback much appreciated, this bug is blocking merges for:
http://bundlebuggy.aaronbentley.com/request/%3Cm24p9ml4l3.fsf@free.fr%3E
and
http://bundlebuggy.aaronbentley.com/request/%3Cm24p9rjm8z.fsf@free.fr%3E
Vincent
More information about the bazaar
mailing list