Designing The Numbers That Govern Wikipedia: Aaron Halfaker on Machine Learning in Large-Scale Open Production

How can we engineer open production at scale, and what can we learn from feminist critiques of technology that could help us achieve those goals? At the Berkman Center this Tuesday (video), Aaron Halfaker talked about the challenges of scaling large-scale cooperation, the values that motivate efforts to keep that cooperation going, and lessons from Feminist Science and Technology Studies for maintaining large-scale socio-technical endeavors like Wikipedia.

Aaron Halfaker is a computer scientist at the Wikimedia Foundation. Halfaker earned a Ph.D. in computer science from the GroupLens research lab at the University of Minnesota in 2013. He is known for his research on Wikipedia, including the decline of participation on the site, the role of automated accounts, and systems to eliminate vandalism while supporting new editors. Most recently, Halfaker built an artificial intelligence engine for Wikipedia to use to identify vandalism. Aaron’s work with Stuart Geiger and others has been a major inspiration in my own PhD research. I wrote about Aaron’s PhD project last June in The Atlantic and blogged Stuart Geiger’s Berkman talk in 2014.

Revscore WP.jpg
Illustration By Mun May TeeCC BY-SA 4.0

Aaron starts out by talking about his early experience as a Wikipedia contributor. Wikipedia is really big, with roughly 5 million articles in the English Wikipedia. To illustrate just how large it is, he shows us the list of lists of lists on the site, walking from sublist to sublist, to the point where you actually learn how to pronounce the name of an ancient Egyptian Pharoah. Wikipedia is also a wiki, a collection of documents that a wide range of people edit — and edit they do. Wikipedia has around 100,000 active volunteer editors, who contribute to a wide range of topics and communities.

Today, he promises to talk about three things: Wikipedia as a socio-technocal system, critiques of algorithmic quality control, and infrastructures for socio-technical change, with a focus on the dangers of subjective algorithms.

These days, Aaron thinks about Wikipedia as a system that converts available human attention into output that looks like an encyclopedia. His work focuses on how this system manages inputs and outputs. As a researcher focused on computer supported cooperative work, Aaron looks at issues where social questions and technology questions are inseparable.

To illustrate the unique challenges of studying Wikipedia, Aaron talks about work by Robin Dunbar, who studied fishing villages and the limitations of their networks at around 150. Wikipedia, on the other hand, has over 100,000 participants. As a researcher, Aaron looks at the ways in which the infrastructure of Wikipedia brings together large collections of people for the work they do together — you can’t just look at the people or just at the technology in order to understand activity at that scale. That’s why he calls them “socio-technical systems.”

The Five Main Subsystems of Cooperation on Wikipedia

Aaron next takes us on a tour of the specialized subsystems that facilitate cooperation on the site.

One set of systems are focused on work allocation. He describes a quote by Eric Raymond that “given enough eyeballs, all bugs are shallow.” Visibility is critical to open collaboration, he says: we can get efficient contributions from people if enough people see things. His parallel is that “given that enough people see an incomplete article, all potential contributions to that article will be easy for someone.”

A second issue is regulation of behavior. This is one of the most well studied questions on Wikipedia. But as an introduction, he says that there are two kinds of norms: prescriptive norms are rules about how you should do things; descriptive norms come from across the community. If you want to propose it, you can write an essay, put it in front of the community, and the community might vote on making it a formal guideline or policy. If a norm gets introduced formally, it’s easier to enforce. One example of a formal guideline is the Wikipedia expectation of “verifiability” for contributions. He talks about his early work on the growth of informal regulations on the site, showing ways that people cite these norms.

Next, Aaron talks about the quality control systems on the site, which are focused on identifying and removing damage from the site. In addition to asking people to look at vandalism, the site has a fully-automated system for detecting vandalism. It’s fast but it can only catch a small proportion of vandalism. Next, contributions are reviewed by a semi-automated system that organizes people to review them in about 30 seconds. Finally, the organization has “admins” who have the capacity to ban vandals.

Wikipedia is built by a large number of people, so community management is also an important part of the system. Each day, around 6,000 people join Wikipedia in some way per day. The site has a system that tries to detect good faith contributors from the bots and vandals so they can offer meaningful support to them.

Another system that facilitates the site are the practices through which Wikipedia reflects and adapts; unlike most other platforms, Wikipedia is led by its users. And so Aaron is accountable to Wikipedia users and contributors in ways that the typical tech company employee won’t be.

The Rise and Decline of an Open Collaboration System
Aaron tells us about his early work on the growth of Wikipedia. In the early days of Wikipedia, everyone knew each other. But from 2004 to 2007, the site scaled dramatically, increasing to around 50,000 contributors. So the people running Wikipedia added quality control tools to maintain the site. Almost immediately afterward, participation on the site started to decline. What led to that?

To help us understand this, Aaron talks to us about research by Donna Haraway on scientists who were studying the same apes. The male-dominated science groups drew conclusions about reproductive competition and dominance, while feminist scientists drew conclusions about communication and social grooming. To make sense of this, Haraway developed a theory of “standpoints and objectivity” — depending on your standpoint, you might think that certain things are important to understand and come to different objectivities. By acknowledging this, Haraway argued, it’s possible to develop complementary objectivities that come from these different standpoints.

Standpoint Theory in Wikipedia

Standpoint theory can help us understand the problems experienced by the Wikipedia anti-vandalism system. Coming from the standpoint that Wikipedia is a firehose, bad edits must be reverted, and that they should minimize effort dealing with those problems. These systems led to a 90% reduction in effort and an increase in efficiency in reducing vandalism. But unfortunately, these new systems were forgetting to welcome the newcomers. In other words, the quality control efforts were overriding the community management goals. Building on the standpoint of newcomer welcoming, Wikipedia has been able to develop new initiatives like “The Coop” and “The Teahouse” dedicated to welcoming new contributors.

Aaron next tells us about a system called Huggle, which has been an essential line of semi-automated defense against vandalism on the site (I’ve written about it here). Huggle shows users possible vandalism and asks Wikipedians to rate a comment as good or bad. When a rater clicks the “bad” button, their comment gets deleted and they get a warning. Even three years later, the system doesn’t account for newcomers who have good faith but might have made a mistake or might be contributing in ways that other Wikipedians haven’t done in the past.

What is it about the technology that made it so stable that it didn’t adapt to new knowledge about its weaknesses? Aaron argues that the problem lies in the machine learning classifiers that sit underneath systems like Huggle. These systems look at all contributions and decide if the contribution is good, bad, or probably bad. All three quality control systems in Wikipedia have machine learning, and all of them have the basic standpoint that led to the decline of Wikipedia in the first place. Developing new machine learning classifiers is *hard* and so people with an interest to develop new machine learning systems from different standpoints have struggled to successfully complete them.

Aaron is working on what he calls “progress catalysts,” things that help people get started with alternative initiatives with less effort required. So he’s been working on centralizing “revision scoring” — the work of identifying exactly what is in a particular contribution. That’s the thinking behind the ORES system, which offers automated measures of particular edits.

Aaron is working to extend the ORES system so that people can evaluate the quality of an edit, moving beyond Good/Bad classification to include “reverted,” “damaging,” and “goodfaith” machine learning models, hopefully offering a “progress catalyst” for others to develop their own responses to problems on the site.

Subjective Algorithms and Feminist Critiques
Citing Zeynep Tufekci, Aaron talks about how algorithms are increasingly making subjective decisions in cases where there may not be a right or wrong answer. Responding to Zeynep, Aaron talks about debates on Wikipedia where machine learning systems might learn cultural biases from Wikipedians and then keep out contributions from people who have alternative views. The Wikipedia:Labels initiative supports people to review and correct this problem of feedback loops. Aaron also talks about other feminist-inspired work, where he’s moved away from building specific tools for people. Now as an employee of the Wikimedia foundation, Aaron focuses on developing infrastructures that make it easier for other people to do their own tinkering and ask their own questions.

Erhardt asks how Aaron handles the process of people proposing things that could go into the machine learning systems. Answer: Aaron often gives this talk at conferences and hackathons. Rather than ask them directly what features should go into the models, he asks Wikipedians what their backlogs are asks them to describe their backlogs. He also reaches out to Wikimedians who are central to communities across a wide range of languages.

Another participant asks how the social networks of Wikipedia participation are changing over time. Aaron talks about all of the implicit interactions that people have, even when people aren’t talking directly to each other. He’s been developing efficient datastructures that support people to collect data and ask questions about interactions on the site.

Question: you showed us an undesirable feedback loop. How could positive feedback loops be identified? Aaron talks to us about the WikiCredit project. Often, subject matter experts choose not to participate in Wikipedia because they can’t necessarily be credited for their contributions. For example, if you edit articles that get many page views, it might be valuable, but the article on Breaking Bad gets far more views than the article on Chemistry. Halfaker is trying to work on a set of measurements that capture aspects of importance, it might help people evaluate their work in meaningful ways. He wants to create recommendation systems that will help people find articles where they could have impact. He also wants a way for academics to be able to point to the impact of their contributions to Wikipedia.

Erhardt asks about the biological metaphor that Aaron uses: we know that there are situations where immune systems go haywire. Is the standpoint problem part of the issue? Might the machine learning systems flash crash like algorithmic trading? Aaron agrees that these are risks as well, and for that reason, he’s curious to think about the immune systems within complex systems– he’s looking for immunologists whoare

I asked Aaron Halfaker about the way that he’s taken into account critiques from feminist researchers when designing new quantitative systems. Is there a way to include qualitative methodologies into the work that he’s doing? Aaron talks about the importance of finding collaborators who aren’t like him and listening to people who may not even be expert researchers.