Notes on fdatasync for the test suite

John Arbash Meinel john at arbash-meinel.com
Fri Aug 26 14:06:48 UTC 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I've been feeling the pain of how slow our test suite has been lately,
and I figured I'd poke around a bit if I could.

Basically, our test suite is now running about 3.5hrs per run, where
back in May it was running 45min per run. I don't have concrete
evidence of the cause. It seems to be something in 2.4, but it may/may
not be the fdatasync change. I see some runs before fdatasync landed
that were 2hrs, though that could be human-submission delay.

I'm planning on looking into this a bit more, since it is actually a
4.5x slowdown. And it has started showing up significantly in our
process.



First off, we got a nice backlog of tests the last couple of days, so
it can be clear how long the test suite takes by looking at the
date-stamps in my email.



08-24 16:16 6039
08-24 19:13 6094 3h
08-24 22:33 6095 3h20m
08-25 01:55 6096 3h20m
08-25 05:21 6097 3h25m
08-25 07:42 6098 2h20m
08-25 09:51 6099 2h09m
08-25 12:33 6100 2h40m
08-25 16:04 6101 3h30m
08-25 23:10 6102 7h06m   probably 2 test runs (first failed)
08-26 02:51 6103 3h40m
08-26 11:54 6104 9h      probably not backlogged


The first thing that strikes me is the large variation. We go from
3+hrs per test run at night, down to 2+hrs/test run in the morning.
The revnos indicate that they were all done on 'trunk', so it isn't a
2.3vs2.4 thing.

Some more times from other days
08-19 14:36 4770         # 2.0 test suite, IIRC runs 2 times for LANG=C
08-19 17:46 6083 3h10m
08-19 22:27 4882 4h45m   # 2.1 test, confirmed not to have LANG=C tests
08-20 01:44 6084 3h15m
08-20 05:12 6085 3h30m
08-20 07:22 6086 2h10m
08-20 11:02 5135 3h40m   # 2.2 test suite
08-20 13:49 6087 2h30m
08-20 14:31 5660 0h42m   # 2.3 test suite ??????, confirmed by the
			 # Revision.timestamps
08-20 16:41 6088 2h10m
08-20 22:36 6036 3h50m   # 2.4
08-21 00:42 6089 2h06m

08-21 13:35 6090
08-21 15:40 6037 2h05m   # 2.4
08-21 19:21 6091 3h40m

The almost scary one is that the 2.3 test suite completed in 42min. I
really don't know what was going on there.

My mailbox only goes back to May 2011, as I've otherwise been deleting
these messages. But back then we have
05-04 02:49 5817
05-04 04:29 5818 2h40m
05-04 11:41 5819 7h12m   # 7h/2 = 3h36m
05-04 14:10 5820 2h30m
05-04 18:17 5821 4h07m
05-04 19:38 5822 1h21m    ?
05-04 21:07 5823 1h29m
05-04 21:55 5824 0h48m
05-04 22:44 5825 0h49m
05-05 00:17 5826 1h33m
05-05 01:04 5827 0h47m
05-05 01:53 5828 0h49m
05-05 02:41 5829 0h48m
05-05 05:14 5830 2h33m   # jelmer was staying up way too late
			 # or there were 2 previous failures


The huge thing I see here is a test suite that is running in 50
minutes instead of more than 3 hours. You can potentially discount the
3-hour runs earlier in the day as not being a backlog, and just being
the delay between submissions. Which would also be consistent with the
time of day. 14UTC is about 4pm in Europe.



I thought that maybe Martin's fdatasync patch could have contributed.
I did a controlled test of just a subset
"bt.per_repository.test_commit". The numbers were:

319.8s	comment out config lookup and force flags to False
322.9s  change osutils.fdatasync to just return
338.5s  unchanged bzr.dev
377.0s  add
  config.GlobalStack.set('repository.fdatasync', False)
  config.GlobalStack.set('dirstate.fdatasync', False)
  In overrideEnvironmentForTesting
  My guess is that the overhead of locking 2 times, and having all
  tests that read files actually have to read and parse the config is
  what slowed this down.

Note that in bulk, the fdatasync seems to account for 339-320/339 =
~5%. That is certainly nothing like 2-3x that we are seeing. Martin's
change landed around July 27th.

However, notice that the original times I'm reporting are also for
bzr-2.1/2/3, which *don't* have Martin's patch.  Though we did have a
40min run of a 2.3 branch.


Here are some runs from July:

07-14 15:11 5655        # 2.3
07-14 15:52 6025 0h41m  # 2.4
07-14 19:14 5656
07-14 20:16 5657 1h02m  # 2.3.5
07-15 11:16 6018        # 2.4
07-15 12:35 6027 1h19m
07-15 13:42 6019 1h07m  # 2.4
07-15 16:13 6028 2h30m  # failed?
07-15 17:08 6029 0h53m

07-25 05:36 6043
07-25 08:53 6022 3h17m  # 2.4
07-25 13:08 6044
07-25 16:01 6045 2h53m

07-27 05:03 6046 1h02m
07-27 07:05 6023 2h02m  # 2.4, this is the fdatasync patch ###

08-01 07:47 5659        # 2.3
08-01 10:36 6025 2h45m	# 2.4, merging up 5659, probably human delay

08-05 18:51 6052
08-05 21:18 6053 2h27m
08-06 03:29 6026
08-06 05:52 6054 2h23m

08-09 11:04 6058
08-09 14:34 6059 3h34m
08-09 19:04 6060 4h30m
08-09 22:17 6027 3h21m # 2.4

08-11 14:20 6030
08-11 17:22 6031 3h02m
08-11 20:50 6061 3h28m


I haven't been able to go through all the logs, and commit/email
timestamps are have a fuzzy relationship with test suite times. (It
can never be faster than the test suite, but it can lag because of
human actions, etc.)

However, I haven't been able to find a confirmed 2.4/trunk run since
Martin's patch landed that is faster than 2hr. And often 3hr+. Which
is a very far cry from the 45min it used to be. However, also note
that 2 days before Martin's patch landed, we already had a couple
2-3hr runs.

Maybe it was something else, like ConfigStack causing us to read files
2-3x, or the Inventory.filter using set() that caused the regression.

I'm running an SSD now, so it could be fdatasync but it only has a big
effect when you have a spinning disk.

Anyway, I plan to try to do some test-suite time bisection on Monday
if I can.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5XqHgACgkQJdeBCYSNAAO5ywCfS1iB6ZLQmG9jUlhyLtO+y3nT
QO0AmQEPpTc0g6+OskIks494ADbVmU7N
=Qfa1
-----END PGP SIGNATURE-----



More information about the bazaar mailing list