In defence of inferential statistics

Letters respond to an APA journal banning the reporting of null hypothesis significance testing procedures and confidence intervals.

20 April 2015

In February Basic and Applied Social Psychology (an American Psychological Association journal) announced that it was banning the reporting of null hypothesis significance testing procedures (NHSTP) and confidence intervals (CI) (Trafimow & Marks, 2015). We are writing to express our hope that the journals published by the British Psychological Society will not be lured into similarly banning CIs and distancing psychology from medical research in which CIs are routinely employed. We believe that CIs offer an as yet undeveloped but potentially very valuable tool for psychologists to interpret their data (see e.g. Smith & Morris, in press). Any ban that involves throwing out the CI baby with the NHSTP bathwater should be avoided.

Trafimow and Marks (2015) condemn CIs because, they say, ‘A 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval. Rather, it means merely that if an infinite number of samples were taken and confidence intervals computed, 95% of the confidence intervals would capture the population parameter.’

It is true, as Cumming (2012) points out (p.79), that a 95 per cent confidence interval refers to the whole process of taking a sample and calculating a CI, 95 per cent of which will capture the population mean. However, it follows that the 95 per cent CIs that you calculate will most likely capture the population parameter.

The great value of CIs is that they provide valuable probabilistic information about the true location of the population mean. NHSTP deals with the normally uninteresting null hypothesis: the probability of the data if the difference or relationship is zero, or some other specific value. CIs help us conceptualise the plausible locations of the parameter (e.g. population mean or effect size), and the variability or precision of that estimate. As Smith and Morris (in press) point out, when we know both an effect size and its CI we can make a much more useful interpretation of the results of our research than when we have an effect size alone. We know of no alternative to standard errors in some form, such as CIs, for describing the likely variability in our effect size if we repeat our research. Given the relatively small sample sizes of much psychology research, the CIs of the effect sizes can be disconcertingly large and remind researchers that a simple effect size, or other point estimate, can suggest a precision that is not justified. Failure to report this variability does not make it go away but does expose those following up the research to dangers of misinterpretation.

Trafimow and Marks’s (2015) solution to the banning of NHSTP and CIs is to require bigger sample sizes and the reporting of descriptive statistics with frequency and distributional data. In general, such information is welcome. However, the reason for the original development of NHSTP was that it is always necessary to decide whether or not to act in the future as if a real effect is likely. CIs of effect sizes give good guidance to such decisions, but it is not clear upon what evidence these fundamental decisions will be based if CIs and NHSTP are banned.

Another issue with demands for larger samples is that psychology researchers are inevitably faced with limitations through cost and time upon the number of participants that they can test. Resources devoted to doubling sample sizes for one study are not then available for new research questions. If the original sample size was, in fact, sufficient, there is a serious ethical and practical question of whether an unnecessary increase in sample sizes will do more harm than good to the future of psychology. How will one decide if the sample is large enough? Given that the purpose of larger samples is to increase the precision of the estimates, reporting that precision should be required, rather than forbidden. Until there are alternative and generally accepted means of answering the question ‘Could the effects have arisen by chance?’, we recommend reporting CIs and, where researchers find them helpful, NHSTP.

Peter Morris, Catherine Fritz, Graham Smith, Amar Cherchar, Robin Crockett, Chris Roe, Roz Collings, Kimberley Hill, David Saunders, Martin Anderson and Lucy Atkinson
University of Northampton

References
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Smith, G.D. & Morris, P.E. (in press). Building confidence in confidence intervals. The Psychologist.
Trafimow, D. & Marks, M. (2015). [Editorial]. Basic and Applied Social Psychology, 31. doi:10.1080/01973533.2015.1012991

Editor’s note: Graham Smith and Peter Morris’s article ‘Building confidence in confidence intervals’ is scheduled to appear in The Psychologist in June.

I’ve recently read of the abolition of p-values by the journal Basic and Applied Social Psychology (BASP) (Woolston, 2015). Whilst there are clearly issues with the misrepresentation or misunderstanding of what p-values mean, it seems a little radical to eradicate them altogether. Unfortunately even within the intellectual arena where we are encouraged to apply more scope and think less in terms of black and white, there still exists a bivalent division of opinion on null hypothesis significance testing (NHST). The fact is that NHST is not a bivalent issue and therefore the energy expended on only arguing either way is wasted.

Contrary to that which is often implied – at least within the social sciences – the p-value doesn’t exist as an instruction to accept or reject the null hypothesis, but rather advises us on how seriously to take the data that we have analysed. Yes, the p-value tells us how likely our data is to occur under the null hypothesis, but it is not statistically strong enough to stand as a lone witness to the alternative hypothesis – It stands or falls conditionally on associated variables (e.g. effect size, sample size). Elimination of the p-value from BASP is a prime example of bivalent, simplistic thinking. Although I feel that I’m merely stating the obvious here, would it not be far better to insist that articles must feature sampling statistics, effect sizes and confident intervals alongside p-values; moreover why not insist on a lower alpha value, say < .01?

A ‘smear campaign’ against the p-value implies that use of the statistic has been noted as problematic and the knee-jerk reaction is ‘Let’s get as far away from this as possible’. This isn’t the logical, measured approach that we should expect from those we rely on to publish our studies and review our submissions, but more comparable to the spin-doctor response that is often so glaringly obvious within the political arena! Given that most brain-related research seems to suggest that the most efficient and powerful result is derived from a combination of several elements working together toward a common goal, it seems surprising that we have missed this analogic lesson when addressing our use of statistical analyses – why not argue for all ways, used together, correctly?
Lee Barber
University of Reading

Reference
Woolston, C. (2015). Psychology journal bans P values. Nature, 519(7541), 9.