Massive collaboration

Jon Brock goes inside the Psychological Science Accelerator.

At school, Neil Lewis Jr was always the ‘smart Black kid’. Aged nine, he and his family emigrated to Florida from his birthplace in Jamaica and he soon learned that his new classmates had low expectations of Black students. ‘They had a stereotype that Black people are not smart’, he explains, ‘so it surprised them that I did so well’. That sense of being judged in the light of racial stereotypes, he adds, has never really gone away. Through high school, university, and even now, as an Assistant Professor at Cornell University, he admits to a constant, nagging concern that any slip-ups on his part would only confirm other people’s preconceptions. ‘For me it’s still a regular experience being an academic where I’m often the only Black person in the room.’

As an undergraduate studying social psychology, Lewis learned about a study that resonated with his own experiences of racial stereotypes. The researchers, Claude Steel and Joshua Aronson, gave students at Stanford University a brief test. Some were told that it was measuring their academic aptitude. Others that it was a puzzle to be solved. For White students, these instructions made no difference to their performance. But amongst Black students, being told that their academic ability was being assessed led to poorer performance. In a further experiment, Black students performed worse if, prior to the test, they had been exposed to negative stereotypes about Black people. Again, White students were unaffected.

Steel and Aronson named this phenomenon ‘stereotype threat’. They argued that the distraction and anxiety caused by stereotypes can lead to poorer cognitive performance, becoming, in effect, a self-fulfilling prophecy. ‘I found it fascinating that scientists had actually studied this’, Lewis says. ‘It was one of the theories that really got me interested in becoming a social scientist in the first place.’

Since it was published in 1995, Steel and Aronson’s original study has been cited by over 9500 other research papers. It has also had real world impacts, prompting colleges and universities to adopt programs aimed at minimising stereotype threat and improving educational outcomes. In 2013, the concept of stereotype threat reached the US Supreme Court when the University of Texas at Austin was forced to defend its racial diversity policy for student admissions. It’s also been applied to other stereotypes – that girls aren’t cut out for maths, for example, or that elderly people necessarily have poor memory.

Lately, however, the evidence for stereotype threat has started to come undone. Attempts to replicate key findings have failed. Meta-analyses that pool the results of many individual studies suggest that the effects are smaller or more variable than originally thought – if they exist at all. Like others in the field, Lewis has begun to have doubts. ‘The phenomenon of being concerned that you might be judged in the light of these negative stereotypes, I think that’s real’, he says. What remains unclear is the extent to which those experiences actually affect performance. ‘Right now’, he admits, ‘I don’t know that we have a good sense of that’.

To try and answer this question, Lewis has turned to the Psychological Science Accelerator, a worldwide network of researchers prepared to take part in large-scale collaborative psychology studies.

Core features
The plan, described in a preprint co-authored by Lewis and 43 other Accelerator members, is to conduct the largest ever test of racial stereotype threat with 2650 students recruited from 27 different sites around the United States. The locations are diverse in terms of the demographics and history of the area, the racial mix of students, and the nature of educational institution, ranging from elite universities to community colleges. This, as Lewis explains, is by design. ‘We’re really trying to figure out what’s going on’, he says. ‘How robust is the phenomenon? How heterogenous is the phenomenon?’

By pulling their data together, the team can investigate whether the stereotype effect varies systematically. Theory predicts, for example, that it should be stronger in more prestigious institutions where the pressure for academic success is greatest and where under-representation of minorities is most apparent. This could perhaps explain why some researchers can find the effect and others cannot. But most importantly, Lewis argues, understanding this variability could help predict where interventions are most likely to be effective. ‘The Accelerator provides a platform to tell us pretty precisely where these things are likely to work and where they’re not going to work’, Lewis says.

The stereotype threat project is one of six currently being run through the Psychological Science Accelerator. The studies ask different questions and are at different stages of completion. But each project shares core features – large and diverse samples collected in multiple locations and, correspondingly, a large and diverse team of research collaborators. It’s a new – or at least a non-traditional – way of conducting psychology research.

Until recently, experimental psychologists typically worked in small teams. They tested a few dozen research participants, often students at the university or members of the local community. And they inferred from those experiments general principles of how the human mind works. This approach has been productive in terms of generating interesting new observations, ideas, and theories. But the past decade has seen growing concerns about the trustworthiness and validity of many of these findings.

It’s not only stereotype threat that is being questioned. Other phenomena such as age priming, power posing, and ego depletion were also widely accepted, featuring prominently in textbooks and introductory lectures, but they are now under clouds of suspicion with independent researchers unable to replicate the core findings. Even when the findings can be replicated, the small-scale, parochial nature of psychology research means that it’s often unclear whether they generalise to other cultures or demographics.

The Accelerator is an attempt to address these issues. As the name suggests, it takes its inspiration from physics, where massive collaboration has become the norm. Chris Chartier, the Accelerator’s founder, admits that it’s something of a rhetorical device, ‘a bit tongue-in-cheek’. In physics, he notes, international research consortia have emerged from the need to share million (and sometimes billion) dollar infrastructure – CERN, LIGO, the Square Kilometre Array, and suchlike. There are no comparable research facilities in psychology. But the idea of physicists around the world working together to solve fundamental problems like gravitational waves and the Higgs boson provides ‘big picture inspiration’ for what psychologists might be able to achieve.

Improving psychological science
Chartier is an Associate Professor at Ashland University, a small teaching university in rural Ohio, midway between Cleveland and Columbus. When he moved to Ashland in 2013, he imagined his research career winding down. ‘I thought I was going to have this quiet life in the country’, he says, ‘teach my students, raise my kids’. He realised that the best way to make an impact with less time and fewer resources was to join a larger collaboration. As a first step, he and his students signed up to two collaborative projects – the Reproducibility Project: Psychology and Many Labs 3 – that were attempting mass systematic replications. That led, in 2016, to his participation in the first meeting of the Society for Improvement of Psychological Science (SIPS).

SIPS, Chartier says, is unlike a normal academic conference. Instead of the traditional keynote presentations and poster sessions, the program is full of workshops and hackathons where people discuss problems and brainstorm solutions. In one of those workshops, he came up with the idea for Study Swap, a website where researchers can post study ideas or resources they need to pull off a study. It launched in March 2017 and has already spawned a number of collaborations, allowing small labs to join forces to increase their sample sizes or replicate each other’s findings. But by the time of the second SIPS meeting, he was already starting to think bigger.

The inspiration for the Accelerator came a few weeks later, Chartier says, with the August 2017 solar eclipse. ‘I guess the eclipse itself puts you in the right frame of mind’, he tells me. ‘But I was also moved by the concept that we – I mean the global human we – could so confidently predict such an event so far in advance. At one point, how baffling it was that this thing would occasionally happen and now we know when it’s going to happen to the second. And obviously that’s a cumulative contribution of many, many people over many years.’

That weekend on a long bike ride, the concept of a ‘CERN for psychological science’ was born. ‘I came home, asked my wife if she would continue to solo parent for another hour, hammered out the blogpost, tweeted it out. And then it just took off from there.’

That initial blogpost envisaged a network of hundreds of data collection laboratories around the world working collaboratively on shared projects. There would be a commitment to the highest levels of transparency throughout the life cycle of a study. Projects would not just be replication attempts but would include original investigations of new ideas. And critically, there would be a democratic process for the selection of studies. The idea, Chartier admits, was the easy part. The real challenge has been sorting out the nuts and bolts. ‘It’s been two years of “That’s cool, we should do that. But how the hell do we do that?”’

The experience with the Reproducibility Project and Many Labs replication projects has been important, Chartier says. It’s shown how some of those practical hurdles can be overcome. But with those and other similar projects – there have now been five Many Labs projects, not to mention Many Babies, Many EEGs, and Many Analysts – each was a one-off. A central team would set the agenda, propose the project, and put out a new call for collaborators. The vision for the Accelerator was for it to be more of an ongoing, sustained, and self-organising institution. That required a new process for making decisions.

Members of the Accelerator are able to put forward project proposals for consideration by the collective. ‘It’s similar to journal peer review’, Chartier explains. ‘We do an initial check. At a base level, does this seem up to snuff? Is it something we should even look at? Why is it that this question really needs the Accelerator? Is it feasible within the network and resources we have?’ If it passes that initial screen, the proposal is then sent to around 10 peer reviewers, including members of the Accelerator and external experts. It’s also shared with the full membership who are asked to rate the proposal. Is it a strong project? Is it a good fit for the Accelerator? Would they run the study in their lab? ‘The committee then takes those and essentially acts as an editor’, Chartier says. ‘Collates those. Makes a decision.’ Thus far, six projects have been accepted. Five (including the stereotype threat study) are in progress. But the first has now been completed.

A model of the whole world
Lisa de Bruine is a professor at the University of Glasgow and a pioneer of online psychology experiments. With her colleagues Ben Jones and Jessica Flake, she has been leading the Accelerator’s first study, investigating the judgements we all make when we encounter new faces. ‘When we look at somebody, we make these social judgements immediately’, De Bruine explains. ‘We make category judgements like what sex they are, what age they are. But also, do they look nice? Do they look dominant? Do they look like they want to hurt me?’ These social judgments, it transpires, are far from accurate. They’re not reflected in the person’s actual personality. And that, essentially, is the point of the research. ‘They’re extremely not reliable’, De Bruine says. ‘But we’re all making them all the time. And they influence our behaviour in consistent and important ways.’

Facial appearance, she notes, can affect whether you are hired for a job, how long you’re sentenced for a crime, and, of course, whether someone wants to be in a romantic relationship with you. Once we get to know someone, their appearance matters less, but those first visual impressions influence who is given that chance. ‘Many of these biases are totally unfair’, De Bruine says. ‘So figuring out how we get around them means we need to understand them better.’

Like the stereotype threat project, the face perception study is based on a classic experiment. In 2008, Nikolaas Oosterhof and Alexander Todorov at Princeton University in the United States asked undergraduate students to rate a series of faces for 13 different traits – their attractiveness, confidence, intelligence, for example. The researchers’ analyses indicated that faces were judged along two main dimensions – how trustworthy the person looked and how dominant. When De Bruine and her colleagues in Scotland replicated the experiment they found the same result. But then her student Hongyi Wang ran the study in China and found a subtly different pattern. Faces were judged for trustworthiness but not dominance. ‘It was more like intelligence or competence was the other dimension’, De Bruine says. Independently, a group of researchers at the University of York found similar results with Chinese participants. It appeared that those initial results in Western countries might not be representative of all cultures. ‘We really need a model of the whole world’, De Bruine says.

To achieve that goal, she and 242 other members of the Accelerator have replicated Oosterhof and Todorov’s experiment on a global scale, testing almost 11,500 participants from 41 different countries. The results, accepted for publication in Nature Human Behaviour, are remarkably consistent over much of the world. Not only in the USA and Canada, Australia and New Zealand, and every region of Europe, but also in Central America and Mexico, South America, the Middle East, and Africa, people rate faces according to their trustworthiness and dominance – the pattern identified in Oosterhof and Todorov’s original study. This was also true in Asia, which would appear to contradict those earlier findings from China. However, as De Bruine notes, the Asia sample includes data from a number of countries other than China.

The results, though, are perhaps less important than the practical lessons the team has learned. As the first Accelerator project, De Bruine’s study has run up against numerous challenges inherent in large-scale distributed data collection. For example, across the many different labs, they had to ensure that data were collected in a consistent way. To this end, each participating lab was required to video themselves running a fake participant. The leadership team would then inspect the videos before giving the lab the go ahead to collect data. Then there was the fact that the experimental materials all needed translating into different languages. That alone took dozens of translators and involved extensive back and forth to and from English to ensure that the different language versions were comparable.

Most challenging, she says, was nailing down the details of the study. Consistent with the Accelerator’s open science principles, the team preregistered the study, describing in detail the plan for recruiting participants and collecting the data, the criteria for excluding participants who were not performing the task properly, and the protocols for analysing the data. Thinking ahead like that can be difficult at the best of times but massive collaboration throws up new challenges. How should participants be recruited to ensure meaningful comparison across labs? How should the data from the different labs and different countries be combined?

‘Putting together data from hundreds of individual experiments in a reproducible way is very challenging’, De Bruine says. ‘I thought it would be straightforward to just run the code I wrote on the data, but things that seem minor when you’re running a small study in your own lab, like excluding the test run you did at 9am the day before you started data collection, become much more difficult to document when you’re dealing with dozens of labs, thousands of participants, and hundreds of thousands of data points.’

Having started later, the stereotype threat project is at an earlier stage but has also encountered challenges. The most significant so far has been negotiating ethical approval at all the participating institutions. Black students are considered a special research population, Lewis explains, and ethics review boards differ in what they consider best practice for recruiting and working with special populations. The challenge, he adds, is not only gaining approval for the study but thinking about how differences in recruitment strategy might affect results from different labs and how the data should be analysed to take that into account. ‘There are all these things that come up when you’re doing this large scale collaborative research’, Lewis says. ‘Ultimately I think it moves us towards a more generalisable understanding of the phenomena. But it takes a lot more logistical work up front.’

Next steps
From a blogpost and a tweet less than three years ago, the Psychological Science Accelerator has grown into an organisation with 760 researchers from 548 laboratories in 72 different countries. Although all the continents (with the exception of Antarctica) are now represented, it remains the case that most of those labs are in North America and Europe, with only one of the six projects – a study led by Sau Chin Chen of Tzu-Chi University in Taiwan – having leadership teams outside of those Western countries. Making psychology research truly global and representative is one of the key objectives of the Accelerator, De Bruine says. But it’s important for the ideas as well as the participants to come from researchers in other parts of the world. ‘We don’t want to be an organisation that is exploiting labs in underrepresented countries to collect data for us’, she says. ‘We’re committed to keep looking at ourselves to make sure it’s not just a Western driven organisation.’

Now they have completed the first Accelerator project, De Bruine and her colleagues are already thinking about the next step. ‘Our project was specifically designed to test out the capacity of the Accelerator’, she explains. ‘We wanted to design a straightforward study that would clearly benefit from a cross-cultural sample of the size that we could achieve with the Accelerator. But I think we can do even more exciting things now.’ One idea she is keen to explore is having teams of researchers devise different methods for testing the same hypothesis to see whether they converge on the same conclusion. ‘I think the power of the Accelerator to do really informative research on a large scale is unmatched by anything else in social psychology right now’, she says.

Lewis’s ambition for the Accelerator is to eventually move beyond computerised lab-based experiments. ‘Right now’, he says, ‘we’re really set up to do web-based experiments where you can give everyone the same link and then they can run that on their campus. But so much of human life is beyond computers.’ In his work outside of the Accelerator, Lewis is looking at doctor-patient interactions in health clinics, but it’s unclear, he says, how well his findings generalise. ‘I really don’t know if what I’m finding is just about New Yorkers’, he admits. ‘I would eventually want to do that research in multiple settings. So again you can learn how much of our findings are about this particular clinic, this city, urban versus rural settings, all these other dimensions that in our gut we know matter. Finding ways to build these large-scale collaborative networks is the only way to test that.’

The ultimate goal, he argues, is to create research infrastructure that allows predictions to be made about how different individuals will think, feel, and behave in different contexts. ‘We can develop interventions and we can know precisely where those interventions would work and where they would not’, Lewis says. ‘This would not only advance our practical knowledge… it would really refine our theories.’

For Chartier, the most pressing concern is funding. Unlike its namesakes in physics, the Psychological Science Accelerator is currently run on a shoestring budget. A Patreon crowd-funding account raises a few hundred dollars a month to help labs in underrepresented countries pay participants. But otherwise, the entire enterprise is run on the voluntary efforts of its members with expenses covered by the participating labs. In Chartier’s ‘Utopian view’ the Accelerator would integrate the entire scientific workflow. Members would submit a research proposal and, if approved, the Accelerator team would then seek funding for the project together. The Accelerator could also have its own journal for peer review and publication of its projects and archiving of the data.

‘Right now’, Chartier says, ‘we’re figuring out processes, getting all the hiccups along the way. These are like little fledgling studies. But eventually I think we could get to the point with the Accelerator where the studies we conduct end up being the biggest difference-makers, the highest impact studies where we consider them the highest evidential value, getting truly global evidence. I think we’re going to look back and think that these 12,000 participant studies are quaint, somehow small.’

-  Dr Jon Brock is a Freelance Science Writer
[email protected]

Illustration: Sofia Sita

BPS Members can discuss this article

Already a member? Or Create an account

Not a member? Find out about becoming a member or subscriber