Failed replications only to be expected

Alexander Bird with a different view on psychology's 'crisis'.

It is widely held that some fields of science – social psychology and clinical medicine in particular – are suffering from a crisis of reproducibility. Too often results that were taken to prove some interesting effect cannot be replicated. Later studies either find a much smaller effect or even no effect at all. Following the 2010 publication of ‘Power posing: Brief nonverbal displays affect neuroendocrine levels and risk tolerance’, one of its authors, Amy Cuddy, achieved a degree of fame unusual among academic psychologists when she publicised its results in a popular TED talk and book. Yet the fame turned sour when subsequent research failed to reproduce key results of the 2010 paper.

A degree of rancour surrounded the subsequent exchange between those who rejected Cuddy’s research and those who supported her. The reproducibility of research is regarded as a key feature of good science. Consequently, when research cannot be reproduced, it is tempting to blame the researcher. Even if not stated explicitly, the suspicion hangs in the air that a non-reproducible result is the outcome of poor scientific practice or worse: a badly designed or conducted experiment, confirmation bias, p-hacking, some other questionable research practice, publication bias – or even downright fraud, as the Diederik Stapel case reminds us.

In such a context it is important to be aware that even research carried out to a high standard might fail to be replicated. For we ought to expect a high proportion of ‘successful’, well-conducted research to turn out to be false when replicated.

The reason for this is revealed by the fallacy of base-rate neglect. In a well-known example, medical students at Harvard were told that a screening programme had been introduced to test for a disease that affects 1 in 1000 in the population. The test is 95 per cent reliable. If someone tests positive, how likely is it that they in fact have the disease? A small minority of the students answered correctly: the chance of having the disease is only 2 per cent. The fallacy committed by those who thought the disease to be highly likely involves ignoring the very low background rate of disease in the population: 999 in 1000 do not have the disease, and 5 per cent of these – about 50 individuals – will test positive.

Some areas of science will be like the screening programme. It is not easy to come up with hypotheses that are true and it is much easier to generate ideas that though interesting are in fact false. The true hypotheses will be outnumbered by the false just as the diseased individuals are outnumbered by the disease-free. Null hypothesis significance testing is like the screening test. When we reject a null hypothesis because the p-value is less than 5 per cent then we are using a ‘screening test’ for truth that is 95 per cent reliable (as regards false positives, i.e. type I errors). If true hypotheses were as rare as 1 in 1000, then only 2 per cent of positive results would in fact reveal the truth – 98 per cent would be false positives. That rate of truth amongst tested hypotheses is hugely implausible. But a rate of 10 per cent is a reasonable estimate. With that background rate of true hypotheses, standard, well-conducted significance testing will produce one false positive result for every two true positives. And so a third of good science of this type might be in error.

What should be done about this? Bayesians will reject the significance testing approach. Frequentists might call for the p-value threshold (‘alpha’) to be reduced, from 5 per cent to, say, 0.5 per cent. While I favour this approach for clinical medicine, it might be difficult for psychology, with fewer resources, to achieve. That said, adopting the medical model of multicentre research would allow for larger studies and so for smaller alpha.

In due course we might hope that theory will increase the base-rate of truth in the hypotheses we generate. But a well-founded theory will itself require reliable empirical results. In the meantime, we could adopt the ‘quietist’ approach to science. Rather than reforming science, we could just be aware that, for the reasons given, science in new and difficult fields is highly fallible. For that reason replication studies are important. If science is to be self-correcting, we should deprecate journals that will not publish such research. At the same time we should also recognise that a failed replication may just represent bad luck rather than bad science.

Professor Alexander Bird
Department of Philosophy
King’s College London

See also "Understanding the Replication Crisis as a Base Rate Fallacy” (British Journal for the Philosophy of Science, 2018)

BPS Members can discuss this article

Already a member? Or Create an account

Not a member? Find out about becoming a member or subscriber