judge (was Re: this week in ~canonical-bazaar)
Martin Pool
mbp at canonical.com
Wed Oct 26 00:09:26 UTC 2011
On 25 October 2011 23:41, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> I must say I'm pleased to see practitioners actually using statistics
> in their work.
>
> But you might want to get a pro to help you with it. :-)
Yes, I was going to look for one once I thought I had it basically
working. Are you a stats pro?
I am aware it is the kind of thing where it's very easy to get
something that seems valid but is actually not well founded. For
example, <http://www.badscience.net/2011/10/what-if-academics-were-as-dumb-as-quacks-with-statistics/>
showing half the papers in prestigious neuroscience journals make a
particular statistical mistake.
Part of my idea here is that if we can get a bit of mechanized wisdom
about how to validly compare two series of timing data, that would be
more reliable than people either just eyeballing it, or making their
own guess about what technique to use.
Beyond that the currently pushed code is very early and it may even
have just silly implementation flaws.
> > The variation in p is pretty surprising to me.
>
> It doesn't surprise me. With a standard deviation of 4 in the
> population and 16 iid observations in each sample, the standard error
> of the mean of each sample is ~1. With a difference of 2 in the
> population means, the mean of command A should be *greater* than the
> mean of command B a significant fraction of the time (surely > 3%).[1]
> So A and B should be statistically close, with a p value near 1,
> fairly frequently.[2]
Right, I'm not surprised to see it vary. However, I think John is
broadly on the right track in wanting repeated sets of trials of the
same two commands to give fairly stable results; if they don't then
judge is not useful in describing whether there is really a difference
or not.
> > Anyway, just a thought, using Power to indicate the confidence in your
> > confidence can be useful.
>
> Yeah, it can, but power is controversial even among statisticians,
> because you need to make strong assumptions that you have no way of
> justifying to compute it at all in most cases.
>
> If you want to compute power, probably the most plausible strategy is
> to use the observed means. But then interpretation is slippery:
>
> Statistician: "Although we could not reject the hypothesis that
> the means are the same, the power of the test is small because
> even if the means are different, the difference is very small."
>
> User: "If the difference is that small, I don't care, anyway."
Right, and this suggests that perhaps we should invert it and rather
than giving the probability that the difference in means would not
arise by chance, instead give the percentage difference in mean value
at a confidence level: 99% sure that the new program is > 5% faster.
m
More information about the bazaar
mailing list