this week in ~canonical-bazaar

Wed Oct 26 08:57:38 UTC 2011

On 25 October 2011 21:46, John Arbash Meinel <john at arbash-meinel.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>> poolie: * went to the codecon camp; was pretty interesting; have an
>> idea for a tiny performance testing program in
>> <https://launchpad.net/judge>
>
> One interesting thought if you want to work on the stats. If you get a
> run that isn't statistically significant after 10 runs, you could
> estimate the 'power' of your test, and determine what N it would take
> to be able to determine the statistical significance.
>  http://en.wikipedia.org/wiki/Sample_size

I think that could be interesting, though the idea of looking at the early
data to determine whether to do more tests rings a bit of a 'statistically
dangerous' alarm bell for me: if you just happen to get certain patterns in
the early tests you may stop there when really you should be taking more.

Another idea we could borrow from timeit is to say we can tolerate the
measurement running for say a minute or an hour, and then estimate how many
tests are reasonable to run in that time.

> Basically, given the observed averages and the observed standard
> deviation, how many samples would you need to get statistical
> significance.
>
> The actual wiki link doesn't give quite the right inversion of your
> ttest_ind, but maybe the stats package would have it?

On the whole I think it's enough to just give the actual difference in
means, if we believe it's significant.  I see that's the same approach, and
I think the same formula, that phk's ministat uses, which gives me some
confidence.

> As a very quick example, you would like the 95% confidence interval of
> the mean to not include the other mean. Which uses:
>
>  n = 16σ^2/W^2
>
> So if program A takes 10s, and program B takes 12s, and you have a
> standard deviation of 2s (for these purposes assume the variation is
> the same for both programs.) Then you want W to be <=2, and your σ=2,
> meaning you need 16*4/4 = 16 samples.

So the question is, what would we like to set W to be?

One option would be to make it say 2% of the mean of the old program, so
that we think the sample mean will be within 2% of the population mean.  But
in the case where the programs are both quite variable and quite different,
this produces an unnecessarily large suggested number of samples.

Another option would be to say we want a confidence interval some fraction
of the difference between the sample means, perhaps 0.25.

So for instance two programs with similar performance, it suggests a
moderately large number:

   n        mean      sd       min       max cmd
  50     171.6ms       6.2     158.6     184.5 bzr2.4 st /home/mbp/bzr/trunk
  50     173.4ms       6.1     156.5     183.3 bzr2.5 st /home/mbp/bzr/trunk
  +1.814ms   +1.1%
difference is probably not significant at 95.0% confidence (p=0.121)
based on these samples, suggested sample size is n>=2946 to have a 0.45ms
confidence interval

and with two programs that are quite different we should need fewer samples:

    n        mean      sd       min       max cmd
   50     256.2ms      34.9     231.8     495.0 bzr2.3 st
/home/mbp/bzr/trunk
   50     173.1ms       8.5     157.6     210.3 bzr2.5 st
/home/mbp/bzr/trunk
  -83.074ms  -32.4%
difference is significant at 95.0% confidence (p=0.000):
based on these samples, suggested sample size is n>=11 to have a 20.77ms
confidence interval

I'm not sure this is really useful though: it's probably enough to just know
there's no difference in the first case, and I don't care that much about
the time saved by doing only 11 tests not 50.

> The variation in p is pretty surprising to me. From 0.018 which is
> almost VERY SIGNIFICANT down to 0.967 very-very-not-signficant. This
> is using nrounds = 16.
>
> Even with 50 rounds, I could get p=0.88, though most of the time it
> was between 0.01 and 0.10.

Reading a bit more about this, it seems that we are better off having a
fixed significance level, and primarily reporting whether the results are
significant or not.  You can report p to the user, but you don't want to
encourage people to think that low p values are particlularly reliable,
because of course it does vary quite a lot depending on the particular
samples.

m
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ubuntu.com/archives/bazaar/attachments/20111026/c7007862/attachment.html>