Ethics and morality, Qualitative Methods, Quantitative Methods, Research Ethics

'If we don’t solve the incentives problem, we will become a very narrow discipline'

Tomasz Witkowski meets Brian Nosek, in an edited extract from Tomasz's new book of interviews.

26 October 2020

This is an edited extract from ‘Chapter 13, Brian A. Nosek: Open Science and Reproducibility Projects’ in Shaping Psychology: Perspectives on Legacy, Controversy and the Future of the Field by Tomasz Witkowski (Palgrave MacMillan, 2020).

Prof. Nosek, you’re one of the world’s most well-known advocates of the open science movement, engaged in improving transparency in science. What led you to go beyond your research on implicit cognition and engage to such a significant degree in improving research methods and collaboration between scientists?

My interest in these issues started in graduate school, when I took Alan Kazdin’s research methods course. This was around 1997, and he had us reading papers from the 1960s and 1970s by Bob Rosenthal, Tony Greenwald, Jacob Cohen and Paul Meehl, where they articulate challenges like publication bias, lack of replication or low power, and they outlined solutions. For example, let’s try and do more replications, let’s increase the power of studies. Even preregistration was mentioned in some of these articles. It was shocking as a grad student in the late 90s to think that methodology has been outlining these problems and solutions for 30 years, but nothing has changed. Why is that? So, like many graduate student cohorts, we would have discussions after lectures or meetings at the bar and talked about how we would change the system if we could. And, of course, we didn’t have the opportunity to make any substantial changes at the time, but the interest in these issues remained at the core of how I thought about the research we did in our laboratory.

So we started with trying to address the power of our designs. We created a website for collecting data about implicit biases, which became very popular. We achieved very high power tests of the kinds of questions we were investigating. When I became a faculty member, we made it a routine practice to share as much of our materials and data as we could on our personal websites. And when services came up, we tried to adopt things that would improve our sharing. In the mid-2000s I started to write grants to the National Institutes of Health and National Science Foundation to create what we called at that time an open source science framework. We had a technical lab for a long time, we operated this website for collecting data, and we thought it would be useful to have a service that would make it easier to share that data and for others to use it as well. But we couldn’t get it funded then because there was just a wide range of reviews. Some said this was very important, a necessary change, and others said that people don’t like sharing data. It just wasn’t the right time. But there was a general interest that we had as a laboratory in improving the process of our own work, building tools for the technical portion of what we do, to make it easier for others to do it.

Then, in 2011, a lot of these methodology issues became of broad interest to the research community because of the Diederik Stapel fraud and because of very surprising results being published in leading journals, like Daryl Bem’s ESP work. And then apaper in Psychological Science, “False-Positive Psychology” (Simmons et al. 2011), really crystallized for many people how we have some practices whose implications for the reliability of our results we don’t really understand. They provided a rhetorical tour de force of how that happened, which helped people to understand those behaviors and their consequences.

So all of that is happening around the same time. One of our failures is with replication. Initial studies became public, and people failed to replicate them. As a laboratory, we had been thinking and talking about a lot of these issues for a long time, but we didn’t have anything concrete, no database of evidence of replicability of findings. We had individual studies that we failed to replicate, but that happens all the time, and you can’t really tell if there’s anything systematic. We decided to think about how to start a collaborative project, to see if we could replicate some meaningful sample studies in the field. So we have this replication project that turned into the Reproducibility Project: Psychology. Lots of people got interested very quickly, and that became a big project.

Here’s the background on starting the Center for Open Science: Jeff Spies was a grad student in the laboratory with a history as a software developer, and we were looking for ideas for his dissertation. We kept coming back to this old project idea of an open source science framework. And while it was unusual, he decided it was very much in line with his long-standing interests, so he started working on that as a dissertation project. So we had this replication project and this technical project that received some broad attention, which we conceived of as lab projects. We were self-funded, without grants or anything, but then they came to the public’s attention, and some funders started calling. Very rapidly we moved from interesting but small projects to thinking about doing things in a big way. Within a matter of two months we went from a small lab to launching the COS as a non-profit.

Why is it that only in the middle of the second decade of the twenty-first century have we openly started discussing the crisis in psychology, and in such a way that outside observers could conclude that its discovery came as a great surprise to scholars?

That is a great question and I would love a good historical analysis to determine which of many possible reasons is the right one. In one sense, we’ve known about so many of these issues for a long time. The kindling for the reform movement has been there, and it has been accumulating for many years, so it had to happen sometime. Let’s say 2011 is the starting point. It could be an accident of history that this was the year it finally started. Another possibility is that all that was accumulating, and some stimulating, singular events occurred that made it a lot easier to confront this at the scale we’re doing now. The Stapel case, the Bem’s paper, the Simmons, Nelson, Simonsohn “False-Positive Psychology” paper. These individual cases captured a great deal of attention beyond just people who care about methodology, and this conversation has been ongoing since the 1960s. Those provided some stimulus to really ask “what are we going to do about this?” or “what is this, really?” The fact that they happened close together in time also may be a factor. Instead of just one event, they all started to pile on, and it’s like a dam broke.

Another factor is that this issue is salient not only in psychology. It has been accumulating attention across other fields. Maybe, with the internet and social media, rather than just the thought that in my little field we have this problem, it is a lot easier to see outside our disciplinary silos and say, “my goodness, biomedicine, they’re having this problem too. Oh, economists, they’re having this problem. Hang on a second. Who isn’t having this problem?” Connecting those communities of people who care about these issues across disciplines may have facilitated collective action that made it a lot easier for the movement to become better, faster, and more impactful. I suspect that it’s a complicated mix of many causes, but it’s fascinating that even though those things have been known for a long time, these events congealed into a real movement this time.

Despite all the problems we are talking about, some scientists are full of optimism. Are you yourself one of those optimists who believe we are at the threshold of a better tomorrow in our field?

Yes. I am optimistic. And the reason I am is that a sustainable culture change is growing. But this doesn’t mean it’s happening fast. There are a lot of people in this movement that are very discouraged. They think, “we’ve worked this out already, why are people still doing it wrong? Let’s just get it done.” But culture change is hard; changing people’s behavior is hard. We have great literature that shows how hard it is to make these changes. Let’s pay attention to the parts of our literature that we can trust and apply it as effectively as we can to the movement as it is occurring. There is good reason to be optimistic, because the core challenges are becoming very well known as problems, and it’s hard to be in our field and not be aware of these challenges. That’s a big first step. Whether people will change their behavior or not remains to be seen.

The second reason for optimism is that training is changing. Methodologists care about these things fundamentally and many are changing and updating statistics and methods courses. That matters, because training will stick with the generations as they come through this gradually.

The third reason for optimism is that the stakeholders, founders, publishers, societies, and institutions are paying attention. Maybe not quickly, but they’re all changing their policies, their norms, and their incentives. Policy change is the best way to have sustainable change in the long term. The general shift from not requiring any transparency to encouraging or requiring openness fundamentally changes what comes through a journal. Incentives for preregistration and badges that make visible that other people are doing these behaviors, can have long-term accumulated consequences of forming new norms. Finally, Registered Reports are now offered by more than 200 journals. Registered Reports eliminates publication bias by conducting review and committing to publish results before the outcomes are observed. That is a fundamental change to the publication workflow and will have a lasting impact if it achieves broad adoption.

All of the critiques of the movement saying that it’s not changing very fast are correct, but I don’t know if it can change faster than it is. The groundwork is being laid. The changes we’ve seen are not superficial. It’s not just one researcher and one team did something, and tried to replicate some findings, and that’s the end of it. Changes are happening in the structure of how we do our work, and that’s really what will help sustain it the long term.

The seriousness of the crisis in psychology is often diminished by describing it as a replication crisis. We know, however, that it consists of many more problems than just the lack of replication. Which of them can be solved by the open science movement, and which should be overcome some other way?

Replication is the low-hanging fruit of improving research practices – here is a finding, here is a methodology, let’s try to see how reliable that methodology is. From this point of view, the replication part of the movement helps stimulate attention to the broader issues of how we can make research progress as quickly as possible. Replication doesn’t solve problems like construct validity, attentiveness to how our theories get refined and formalized so that they are testable, or connecting our theories with our operationalizations and inferential tests. There are big challenges in how we reason, how we accumulate evidence, and how we combine that evidence into theories in science. Open science doesn’t solve these, but open science is an enabler of pursuing solutions to these challenges.

We have a lot to mature in how we develop theories and how we make connections. But we can’t do that work without open evidence, without transparency on how we’re getting our claims, without better sharing, without materials being open and accessible, and without some replication to test our theoretical positions in a more formalized way. To me, the movement, at least in psychology, is finishing phase one, where the major theme was replication. I think it’s entering or midway into phase two, which is about generalizability. It’s not just about replicating a finding in that context, but about the breadth of where we can see that finding across different operationalization, across different samples, across different conditions of the experimental context. Next, I think, it’s going to be entering a phase focused on measurement and theory. Now we see what is reproducible on our base claims, how we can better connect this evidence to more theoretical claims?

I see one more problem plaguing contemporary psychology, and which probably cannot be solved by the open science movement: the shift from direct observation of behavior, widely regarded as an advance in the development of scientific methodology, to introspection. This was demonstrated in an outstanding 2006 article by Baumeister et al., and recently confirmed by Doli´nski (2018), who replicated Baumeister’s investigations. Both articles show that over the last few decades, studies of behavior have become a rarity among psychologists. This issue is brought into sharper relief by the fact that the first of the two articles was published in the middle of the decade that the APA announced with great pomp as the decade of behavior. What are your thoughts on this issue?

This is interesting, and I agree that it can’t be different than how we think about what the open science movement is trying to solve. We have to consider the incentives that shape individual researchers’ behaviors. If the focus of my attention in research is to publish as frequently as I can, in the most prestigious outlets that I can, then I have to make practical decisions about the kinds of questions that I investigate. Measuring real behavior, that’s hard. Sending out a survey, that’s easy. I can generate more publishable units doing that.

Or using Mechanical Turk.

Exactly. Researchers are already stressed about publication demands, and now open science movement is increasing the pressure for higher power. This makes it even harder to do those behavior studies – now, instead of 50 people you need 500. This is a real issue, elements of the open science movement – increasing power, increasing transparency, increasing rigor – could be at odds with the goal of getting the research community less focused on self-report surveys and more diverse in how it measures human behavior. If we don’t solve the incentives problem, then we will become a very narrow discipline. In part, that may reflect how hard is to study the things that we study. We have to recognize where we are as a discipline in terms of our tools and instrumentation, to be able to do some kinds of science. Then, we have to be realistic about what science can be done effectively with the available resources.

In some ways, we have tried to study questions that we just don’t have the technology, the ability, or the power to study effectively. If we take seriously what questions can we productively investigate with the resources we have available, then we may recognize that a lot of the questions we want to study, cannot be studied effectively the way that we do science now. If those questions are really important, then we need to change how we do things. This is where another element of the open science movement has part of the solution–collaboration.

Right now, we are rewarded for being a vertically integrated system. Myself or my small team comes up with an idea, designs a study, collects the data, writes the report, all on our own. This requires lots of resources devoted to small groups. If we move to more horizontal distribution, where many different people can contribute to a project we can study many questions that we are presently not able to study well. There might be one team designing the study, and 15 teams that contribute data. With that kind of model, if we get the reward system worked out we could study some questions more productively and with adequate power to actually make progress on the problem.

The reform movement needs to attend to the effects of each change on the complex system of incentives and reports: What happens with increasing expectations for power? What will that change in what people study? How do we solve that? These systems are complex and getting all the incentives aligned for maximizing progress is hard.

There is another problem that is troubling psychology, and which you very diplomatically described in one interview as “conceptual redundancy.” We know that it is basically about cluttering our field with needless, often duplicate theoretical constructs, about unnecessarily publishing and creating new concepts for previously known and described phenomena. This conceptual redundancy is increasing at a rather alarming rate. What is your opinion about it?

I think it’s another illustration of a different part of the reward system. In psychology we value the idea that each person has their theory, and their conceptual domain that they study that is linked to them and their identity. The consequence is that this incentivizes the splitting approach, where the same concept is being studied by five different groups, each giving it different labels. Early in research, this can be useful. When you’re in a generative phase, we have no idea what this problem space is like, and we need various approaches to explore that problem space. What we don’t have is the consolidation phase, where we say, “OK, there are seven different groups that have what seems to be the same kind of idea.” How do we figure out what actually is the same and where the differences are? That lack of consolidation leads to a very fractured and not very accumulative discipline.

The social challenge we need to address is that we’re individually tied to the words we use to describe psychological phenomena – I don’t want my perspective on self-enhancement to be combined now with somebody else’s idea that is the same, but uses a different word to describe it. The methodological challenge is to create occasions for similar ideas to be confronted against one another. And there is a great example, in development now, that I think would be useful as a prototype, and that I am hoping will become ordinary practice. The Templeton World Charity Foundation organized group of neuroscientists who all have different theories of consciousness. The scientists are told “you’re going to sit together in a room for three days, and you’re going to come up with experiments where at least two theories have different expectations of the outcome.” And of course, it took them two days of yelling at each other to even figure out what the differences in their theoretical expectations were. But if we can stay scholarly, this process can be very productive to provide some clarity about the actual similarities and differences between these theoretical perspectives. They did come up with two experiments for which the theories make different predictions. And now they have funding to do the experiments. That is a great exercise, so I am hoping that some initial prototypes of that will be very widely disseminated, and generate lots of attention for how creating a confrontation and consolidation process to counterweight the spreading process as you introduced.

Is there any particular type of criticism aimed at the open science movement that you feel is particularly serious and justified?

I think it’s all serious and justified, unless it’s personal. The claims of the open science movement need to survive critique, just like all of the findings that are being critiqued in replications need to survive their critique. Let me give a couple of examples. There is a critique that preregistration will reduce creativity, make research more boring. I don’t have any evidence to show you that it will not do that. And we can generate plausible stories for how it could. For example, if I have to pre-commit, I might get more conservative, because pre-committing to crazy ideas might be embarrassing, particularly if they are wrong – which most will be – they are crazy ideas after all. But, some crazy ideas are totally transformative. Preregistration could induce risk aversion. If I don’t have to pre-commit and I can just do it, then maybe I’d be more willing to take risks because who cares if I am wrong.

Whenever we’re pursuing culture change there is potential for unintended consequences. In fact, unintended consequences are functionally inevitable, because we don’t know all the consequences of our actions. If the open science movement did not have a skeptical audience constantly evaluating what happens when we make these changes, we will end up doing some things that are counterproductive for research progress.

So, I am very glad that there is positive engagement of skepticism for these changes. What I don’t like is when it gets personal, like the “methodological terrorists” kind of remark. This is completely unproductive. It is also unproductive when the skepticism is so strong that people don’t even want to try. The whole purpose of research is to try something and see what happens, and if we are so ideological that we say, “I cannot share the data because it will screw things up,” so I even don’t try to share, or “pre-registration isn’t even worth trying, because I am sure that it will reduce creativity,” that isn’t really engaging in research. Competing perspectives are valuable as long as we are studying it, learning something, and then figuring out how to do it better.

Fortunately, apart from problems, our field also has a lot of achievements. Which of the existing psychological discoveries do you consider to be the most significant breakthrough?

I can’t say confidently, because I’d have to review all of the literature to say which of those things are the most important. But what comes to mind as you ask the question is the astonishing progress that has been made on understanding visual perception from what we understood about how the visual system works in 1900. This is a massive transformation. We’ve learned so much, and so much of that work has been applied to computer vision and other kinds of research applications, and to our practical understanding of the visual system in animals and in humans. I am a total outsider to it, but I love what I know of the work.

Kahneman and Tversky’s heuristics and biases is, I think, the most directly impactful on questions in the social cognitive domain of judgment. That’s an obvious one to mention, but the degree to which we now understand motivated reasoning in the big picture and particular biases that are current reasoning is incredibly important. What we need from that field is to grasp how we deal with those biases in everyday judg¬ment and decision making. We need systems and solutions to address unwanted biases where they occur. It would be transformative for human behavior if we can solve these questions.

The last theory would be areas where we have effective treatments for some areas of mental health. The fact that we can address many types of phobias, in a single session, or in 8 hours with cognitive-behavioral treatments, is astonishing.

Each of these examples shows that basic psychological knowledge can be translated into an understanding of how we improve human behavior.

Which projects are absorbing you at the moment and what are your plans for the future?

The main research project that is dominating our attention is called the SCORE project. It’s a project funded by DARPA. We’re one of multiple teams involved in three technical areas: TA1, which is us, TA2 has two teams, and TA3 has three. The goal of the project is related to artificial intelligence, to see if we can create automatic indicators of the credibility of research claims. When you open a paper, each of the claims in the paper could have a score next to it: This one had 72, this one has 15, that one has 99. The machines would give an initial calibration of confidence that we can have in the credibility of the claim. This is a high aspiration, but pieces of evidence suggest that there is information we could extract from papers, and from the research at large, to help us assess the credibility in different findings as an initial heuristic. How much other evidence are these findings supporting? How does it fit with other claims? What fits that particular claim or that particular study? It’s an exciting problem to try to solve, and the actual work that we’re doing in our team is extracting claims from the literature.

We took 60 journals and extracted a sample of 50 articles from each year from 2009 to 2018 to create a database of 30,000 articles from the 60 journals. Then we take 10% of those as a stratified random sample, and we extract a claim from each article. We trace the claim from the abstract to a statistical inference in the paper supporting that claim. We then created a database of 3000 claims. TA2 teams evaluate the cred¬ibility of those claims with expert judgment and prediction markets giving each claim a score. TA3 teams are applying machines to try to assess the credibility of those claims. The machine gives them scores based on whatever information they can gather.

While all that is happening, we organize a thousand people in a massive collaboration to do replications of the substance of those claims as the ground truth. We shall see if the people in TA2 and the machines in TA3 can predict successful replication or not. This project will generate very useful data to study many questions in metascience and replicability. It’s an extremely generative project, while simultaneously having a clear structure. And, the problem we are studying is an exciting problem.

The other problem we’re really interested in studying is whether the various interventions we have introduced to improve the research process are working or not. We are running studies, for example, on Registered Reports to assess whether it’s actually meeting the promise that we theorize, and whether there are costs, like the problem of creativity or conservatism that might emerge. Those would be very useful data to really help refine and improve the reforms – let’s keep doing the things that are working and let’s change or stop the things that aren’t. This is the main areas of focus for me in the next few years.

While working on this book I asked my readers to submit one question they would like to ask an eminent living psychologist. I received 30 of them. Would you agree to draw and answer one?

Sure, give me number 15.

What psychological idea is ready for retirement?

I’d say the idea that psychology is so different from other scientific disciplines that we can’t use practices in other disciplines to inform how we can do ours better. I think this is partly a conceit, and partly low collective self-esteem that we think this way: “The reason that people see us differently, they don’t respect us, is because we are totally different.” But we aren’t totally different. We are studying hard problems, perhaps the hardest among the sciences. But to then become so defensive that we can’t look to other sciences to identify ideas and new practices to do ours better is a defensive stance that I think needs to die.

References

365 days: Nature’s 10. Ten people who mattered this year. (2015, December 24). Nature, 528, 459–467.

Baumeister, R. F., Vohs, K. D., & Funder, D. C. (2007). Psychology as the science of self-reports and finger movements: Whatever happened to actual behavior? Perspectives on Psychological Science, 2(4), 396–403.

Cohen, J. (1962). The statistical power of abnormal-social psychological research. Journal ofAbnormal and Social Psychology, 65(3), 145–153.

Doli´nski, D. (2018). Is psychology still a science of behaviour? Social Psycho¬logical Bulletin, 13(2). Retrieved from https://doi.org/10.5964/spb.v13i2. 25025.

Hunter, J. (2000). The desperate need for replication. Journal of Consumer Research, 28, 149–158.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Journal of Social Archaeology, 22(11), 60–80.

Wolins, L. (1962). Responsibility for raw data. American Psychologist, 17 (9), 657–658.