In the second part of this post we will see how one can reject an incomplete (from Popper’s perspective) scientific narrative using significance tests without making an appeal to the frequency interpretation of probability. Actually there are at least two paths, both of which can be seen as numerical approximations to a Bayesian selection procedure.
In the first, less general path, one would need to impose additional structure to the scientific theory being examined. In particular, the scientific theory/narrative should not only specify which data are incompatible with the hypothesis, but is should also provide that these are arranged in some order: if is impossible, so is . In such a case, a scientist performing an experiment can say that if is obtained, then the theory will be falsified. Such a statement can be put back in a solid Bayesian framework if the experiment is reinterpreted as a setup yielding data that either exceed (and thus are improbable) or are below (and thus certain) the cutoff. In other words, we are now viewing the experiment not as one generating specific data (e.g. etc) but as one generating data and . The likelihood of the null hypothesis under this interpretation of the experiment and for such a scientific theory is thus:
Most of the hypotheses that Fisher evaluated in his applied agricultural science work (or for that matter almost all the hypotheses we evaluate in medical research, political science, biology or economics) are not so developed as the physical theories that Popper had in mind, so as to specify that there exist regions of potentially observable data with exactly zero likelihood. Hence if one is to apply the Popperian framework, the best one can hope for is an approximation i.e. instead of falsifying when :
one falsifies when
with a small number close to zero.
The second, more general path, also requires that we view the experiment as generating ranges of data rather than individual data points but without requiring the monotone behaviour of the likelihood. More specifically, we partition the sample space i.e. the potential measurements in regions of low (e.g. < 5%, the critical region ) and high (everything else) probability. In this approach, the likelihood of the individual data can be non-monotone and even multi-peaked: all that is required are we split the potential outcomes of the measurement process in areas in which the Area Under the Curve of the likelihood is (arbitrarily) small. By doing so, we are deliberately throwing away precision as all data points falling in the two regions of the likelihood are considered on an equal footing. Then once the experiment is performed we note whether the actual data (or rather the value of the statistic based on the experimental data), falls in the rejection region in order to falsify.
Obviously either interpretation of Fisher’s approach to p-values as an approximate Bayesian evaluation of alternatives to is a personal one. Nevertheless, the derivation above goes some way in showing the value of close scrutiny of the “inner-workings” and the mathematical basis of some procedures. The take home points are the following:
- there are no free lunches in the choise-among-narratives world. Only some very developed theories can be rejected in isolation and these are very unlikely to be found in the “soft” sciences. For all other fields (medicine in particular), the “alternative-free” modus operandi of p-values is a dangerous illusion both in terms of false positives and false negatives
- for such under-developed theories, the magnitude of the p-value may be used as a measure of the implausibility of the null when p-values are viewed as approximations to the Bayesian way of choosing among narratives. Note this interpretation is specifically forbidden when one sticks with the frequentist interpretation of the p-values).