I was thinking about the phenomenon of phacking (or data dredging), and how it's much more likely to give apparently significant results than proper hypothesis testing would give.
In particular, I'm considering the case where a researcher keeps increasing the sample size until statistical significance is reached. (As in, "Oh we ran the test with 20 participants, and the data are suggestive but not significant, so let's test another person," repeated until significance is reached.)
My question is, if this process is allowed to continue indefinitely, what's the probability of eventually hitting a statistically significant result (at some predefined level of significance)?
For example, when I modeled it as a thousand runs of 100 coin flips, 54 had few enough total heads for p<0.05, while 194 reached "too few heads" significance at some point during the run (i.e. 140 random walks stumbled into and then back out of significance). When I did a thousand runs of a thousand coin flips, 47 had "too few" heads overall while 346 reached significance at some point, meaning an additional 150ish of the random walks that never stumbled into significance in the first 100 steps managed to do so at least once in the subsequent 900.
Is "eventually stumbling into significance" the kind of tail event that will almost surely happen as the runs are allowed to get arbitrarily long, or is there some limit strictly less than 1?
Also, is there a known expression for the probability of stumbling into significance at some point on a walk of length N (i.e. an expression which would give something near 19.4% for N=100 and something near 34.6% for N=1000)?
Mathematics of phacking: random walks and significance
Moderators: gmalivuk, Moderators General, Prelates

 Posts: 220
 Joined: Tue Jun 17, 2008 11:04 pm UTC
Re: Mathematics of phacking: random walks and significance
If the underlying process is exactly zero mean or fair with respect to what you're testing (i.e. fair coin flips), then stumbling into significance will almost surely happen if you are allowed to keep going arbitrarily long, and a multiplicative decrease in number of runs that have not yet hit significance at some point requires you to multiplicatively increase long you are going. i.e. the inverse of the proportion of runs hitting significance should be asymptotically polynomialish with the exponent of the polynomial depending on your threshold on p.
Basic intuition:
I don't know an exact expression though.
On the other hand, if the underlying process is slightly biased rather than exactly fair, such as coins biased in favor of heads, then you don't get this sort of long asymptotic tail. Roughly, stumbling into significance on the "correct" side will happen with rapidly increasing chance as you start reaching the amount of data needed data to distinguish the bias from noise, whereas doing so on the "wrong" side will only happen with some total probability strictly less than 1.
As a result, I think the "increase your sample size until significance" issue alone, unlike things like publication bias and most other experimenterdegreesoffreedom, is dealablewith if you just pay attention to confidence intervals on effect magnitudes rather than only the sign of the result. Because if someone really has no other degrees of freedom and is obligated to publish everything and the process underneath is truly zero mean, doing repeated studies where you try to continue until significance on each one eventually requires the researcher to publish results where they had to collect arbitrarily much data before hitting significance and therefore be publishing confidence intervals on effect magnitudes that are arbitrarily small and close to zero. And if the process wasn't truly zero mean, then you can only make the "wrong" conclusion with bounded probability and eventually will have to publish studies that taken together make you strongly confident in the effect in the correct direction with the correct magnitude.
Alternatively, you can also consider things like SPRT.
Basic intuition:
 Do 100 flips. Did you ever hit significance?
 No? Okay, do 10000 more flips, which is so much more data that it should completely wash out and make negligible the result of the 100 flips and give you another almost independent "chance" to find p < 0.05. Did you ever hit significance?
 No? Okay, do 1000000 more flips, which is so much more data that it should completely wash out and make negligible the result of the 10000 flips and give you another almost independent "chance" to find p < 0.05. Did you ever hit significance?
 ... etc
I don't know an exact expression though.
On the other hand, if the underlying process is slightly biased rather than exactly fair, such as coins biased in favor of heads, then you don't get this sort of long asymptotic tail. Roughly, stumbling into significance on the "correct" side will happen with rapidly increasing chance as you start reaching the amount of data needed data to distinguish the bias from noise, whereas doing so on the "wrong" side will only happen with some total probability strictly less than 1.
As a result, I think the "increase your sample size until significance" issue alone, unlike things like publication bias and most other experimenterdegreesoffreedom, is dealablewith if you just pay attention to confidence intervals on effect magnitudes rather than only the sign of the result. Because if someone really has no other degrees of freedom and is obligated to publish everything and the process underneath is truly zero mean, doing repeated studies where you try to continue until significance on each one eventually requires the researcher to publish results where they had to collect arbitrarily much data before hitting significance and therefore be publishing confidence intervals on effect magnitudes that are arbitrarily small and close to zero. And if the process wasn't truly zero mean, then you can only make the "wrong" conclusion with bounded probability and eventually will have to publish studies that taken together make you strongly confident in the effect in the correct direction with the correct magnitude.
Alternatively, you can also consider things like SPRT.
 gmalivuk
 GNU Terry Pratchett
 Posts: 25519
 Joined: Wed Feb 28, 2007 6:02 pm UTC
 Location: Here and There
 Contact:
Re: Mathematics of phacking: random walks and significance
Yeah, I know there are ways to deal with sequential testing, just as there are ways to adjust for multiple comparisons. I was just curious about the probabilities when things are being done improperly, either through naivete or dishonesty. (For example, as you increase the number of separate ("sufficiently" independent, whatever that may turn out to mean) questions you ask about totally random data, the likelihood of hitting p<0.05 for one of them goes to 1.)
Who is online
Users browsing this forum: Bing [Bot] and 8 guests