How are mass collaboration and open data changing the ways we do social science? While we’re used to thinking about data science as a major impact of computation on the study of human behavior, changes in scientific collaboration are driving even more fundamental change in the study of human behavior. In recent years for example, the Reproducibility Project: Psychology coordinated over 270 authors to attempt replications of published findings in psycholog. The working method they developed could dramatically improve the quality and rate of experimental knowledge, even as it revealed serious weaknesses in the slower single-study approach that has been more common.

One leader in this transformation of the social sciences is Calvin Lai (@CalvinKLai), a social psychologist and post-doctoral fellow at Harvard’s Safra Center, where he takes “big science” approaches to answering important about psychology. Calvin joined The Berkman Center’s Cooperation Working Group this April to talk about his big science approach. I was excited to learn from Calvin, since my own PhD focuses on creating open, replicated knowledge on social behavior online, with an emphasis on replicating studies on the effects of communit moderation. Here are my notes from our conversation:

Calvin Lai speaking at the Berkman Center

Calvin opens up by pointing out that as science has progressed over time, our questions have become more complex. People are specializing into specific questions that require high levels of expertise, or they’re realizing that there are big problems that a single lab at a single university can’t do, and not even a few labs working together. How can we tackle these problems, asks Calvin? One solution is to bring scientists together to pool resources and creativity to do things they can’t do themselves.

When people talk about big science, we often think about the large Hadron Collider at CERN, where there are literally over 10,000 scientists from hundreds of universities all working to find something out about what matters. But that kind of collaboration is uncommon outside the so-called hard sciences, Calvin tells us.

Big Science Versus Big Data: The Value of Carefully Designed Studies
How might the big science approach help us answer questions about how people think and act, and what they do? Within the social sciences, we often talk about big data, where a small group of researchers looks at a lot of data, and a lot of that research happens within companies and on social media sites. But one limitation is that it’s very noisy, and you have very indirect variables. An alternative approach is needed to ask questions about more precisely-collected variables.

Finding the Best Ways to Reduce Racial Bias with Big Science
Calvin shares two examples of big science. The first is a research contest to change people’s response to race, a study of implicit bias. What is implicit bias? These are thoughts swirling around our minds beyond our explicit control that influence how we behave. We might think of something as a gut feeling, or something where we catch ourselves before saying something. Over the past 30 years, psychologists have tried to figure out how to capture those processes in a bottle. We can’t just ask people what they think — we need to assess it indirectly through other measures. The most popular measure is the implicit association test, where you play a sorting game. The speed at which you sort things tells researchers how things are linked in their minds.

Things that are more closely linked in our memory are more easily done, so people tend to be faster and better at sorting things. That allows Lai and colleagues to study the associations that people make with race. Most whites and asians have a pro-white bias. For example, in one study, 72% of white Americans who visit the Project Implicit site have shown a pro-white bias. The Project Implicit site was started in 1998, and over the last 17 years, they’ve collected data on a wide range of implicit biases from over 50 countries. And these biases seem to be predictive of discriminatory behavior, whatever people might believe.

Normatively, we might want to try to change people’s implicit racial biases. People have tested hundreds of ways to change these implicit biases. We know we can do it. But if a policymaker came to them and asked: “how can we do this?” it wouldn’t be possible to give a straight answer. Individual researchers go on their own paths, and it’s not possible to compare the effectiveness of different interventions. Calvin wanted to know: what’s the best way to reduce implicit racial biases?

Coordinating Mass Collaboration with Research Competitions
To find out the best way of reducing implicit biases, Calvin created a research competition. He wrote researchers around the world asking them to suggest their best idea for reducing implicit biases. 24 researchers suggested 18 different interventions, with the idea of testing them all at the same time. Why a research challenge? Calvin pointed to the Ansari X prize that organized engineers to create a private space shuttle, an initiative that kickstarted the private spaceflight industry. One reason it was so successful was that it found a single problem that a lot of people knew was important, and it pooled resources, handed out an incentive, and asked people to work on that problem. Another example was the Netflix prize, where they asked the public to create the best algorithm for matching people to movies.

Calvin wanted to apply that to the important public issue of racial bias. The prize was author order. Calvin would be first author, and the rest of the authors would be sorted by the effectiveness of their intervention. They ran the experiments with roughly 17,000 non-white participants on the Project Implicit site and gave teams a chance to revise their experiment.

Researchers tried four overall approaches, all of which had to fit within 5 minutes. The first approach was to offer counter-stereotypes. The second approach was to change how people took control over their biases– perhaps you could give people a strategy to override their biases. A third approach was for people to reflect on their values. The fourth approach asked people to take the perspective of a black person. In a final intervention, they taught people how to fake the test, as a point of comparison to gauge successful interventions. In other words, they got a lot of great ideas that sounded plausible, all of which had some evidence behind them. By pooling researchers, they were able to show what works.

Calvin shows us a chart showing the results of the interventions. Half of the interventions were effective, and the other half were not. That was shocking, says Calvin– since they had asked researchers to suggest their best ideas. The first run of the experiment was even worse– only one out of five worked. After further iteration on the experiments, they were able to refine the experiments to the point where half were effective.

Overall, interventions that involved perspective taking and reflecting on your values tended not to work at all. In contrast, things that involved controlling bias or exposure to counter-stereotypes were most effective. There’s also a lot of variability between the weakest and most powerful interventions, ranging from reducing bias by a quarter to reducing bias by a half.

Do Bias-Reduction Efforts Last?
The second phase in the contest asked if the effects for the 9 effects lasted after a 24 hour delay. To test this, they were able to convince 18 universities to participate. The more participants contributed by a university, the higher position they played in authorship order. The answer is: no, they don’t last at all. The effects dissipate very quickly. This has been nerve-wracking, says Calvin. They’ve taken 100 interventions from the literature, taken the top 18, found that only half of them worked, and then found that of all of those, none of them had an effect after one day.

Mass-Replication of Research
Next, Calvin talks about the Reproducibility project. One of the foundations of science is reproducibility. If we do a study, we expect that if someone else does it, they should get the same effects–that’s the hope of generalizable knowledge. Unfortunately, there are roadblocks to reproducibility, and it doesn’t happen very often. Incentive structures discourage replication by individuals, and publishing incentives can reduce replicability, since journals are often looking for novelty. Basically, the lack of reproductions is a cooperation failure. Individually, it makes sense for people not to invest their resources into replications.

The Reproducibility Project took 3 journals from top psychology journals, sampled 100 studies from these journals, and tested whether doing them again got the same effect. Instead of harnessing scientific resources to create new ideas or pool participants, they were repeating what came before. Doing 100 replications is no small task. Calvin ran two of the replications. Students at a class at Stanford each did a study. They ended up with a bunch of authors: around 270. The goal was to get a representative sample of the field.

Calvin shows us the effects from the original studies. Of the 100 original studies, 97% of them had a statistically-significant effect. When they did replications, 37% had a significant effect, using samples that were often much larger than the original projects. The effect sizes were also much smaller on average. Perhaps the foundations of psychological science are not firm as we thought, Calvin wonders. What’s in the published literature may not be a representation of what’s robust and reliable. Now, another group is doing a reproducibility project for cancer biology.

Summing up the Benefits of Big Science
Calvin concluded with a few quick takeaways about large scale cooperation:

it can help you solve bigger and bigger problems
it can spur innovation
it can broaden participation in science. Someone from a small university might not be able to run a large lab, but they could add a small part to a larger effort