Data Science for Gender Equality: Monitoring Women's Voices in the News

Data Science for Gender Equality: Monitoring Women's Voices in the News

Can high resolution data and innovative technology help us create better representation of women in the news?

I believe so. Over the next year, my thesis project is to design for gender equity with a series of articles, artistic pieces, and technologies. I'm going beyond mudslinging and hand wringing to apply technology in constructive ways that can make a difference. And I need your help.

Today in the Guardian Datablog, Lisa Evans shared some initial results from software I created to track gender in the news. This post explains my vision for the larger project, describes our research methods, and points to other resources on gender in the media.

Women's Voices in the UK and US

Women wrote less than a third of articles in the Daily Mail, Guardian, and Telegraph from July 2011 through June 2012, according to data analysis I carried out with Lisa Evans and Lynn Cherney. In the US and UK alike, the overwhelming majority of public voices remain male, even though women are more likely to vote or sign petitions than men (UK pdf, US). Less than 25% of British MPs and only 16% of the US Congress are women.

Opinion sections, which shape a society's sphere of consensus and open opportunities for writers, are an important measure of women's voices in society. Women are more prominent in UK opinion pages than they are in American newspapers. According to Taryn Yaeger of the Op Ed Project, women write 20% of op eds in America's newspapers. Across the UK papers we studied, the rate is 26%.

Opinion Articles in 3 UK Newspapers by gender, July 2011 - June 2012

The relative prominence of women's voices in our UK data is entirely due to the Guardian. Articles by women constitute around 20% of opinion articles in the Telegraph and Daily Mail, the same proportion as US newspapers. In one year, the Guardian's Comment Is Free section included 11,247 articles, around 60% of our dataset. The Guardian publishes so many women (33%) that Guardian opinion articles by women outnumber the Telegraph's male op eds, as far as my algorithm can determine.

3.5% of the Guardian's opinion articles were written by a mixed gender team. I would love to see this category grow.

At this point, you probably have a lot of questions and concerns. How did we develop these numbers? Could there be errors in our approach? You might dispute our interpretation. Most damning of all, you might wonder what good this data really is. Are we trying to start a fight or is there a real chance to use data to support constructive change?

Measuring Media Gender at High Resolution

Until now, anyone who wanted to measure gender in the news had to count articles by hand. This study by Kira Cochrane looks at 3,500 articles. The Global Media Monitoring Report covers around 16,000 news items across nearly 130 hundred countries. Human "coding" of the media allows for detailed analysis of content, but projects like the GMMP can take years to complete. The software I have developed takes a few seconds to label an entire year of three UK newspapers. I'm also working on multi-decade datasets that include millions of articles, potentially across hundreds of media sources.

High resolution, realtime data on gender in the media creates new opportunities to take practical action beyond wringing hands and pointing fingers. Here are some ideas:

  • News organisations can be empowered with gender insights at the rate of decisionmaking rather than just facing an occasional, depressing report
  • Integrating audience metrics with content demographics might highlight opportunities to find new markets
  • Advertisers and advocacy groups could pressure apathetic newsrooms to acknowledge the business disadvantages of paying limited attention to women

In case media justice doesn't lead to a pot of gold, what are the alternatives? In this area, The Op Ed Project is my muse. As Erika Fry explains in the Columbia Journalism Review, The Op Ed Project sets aside finger-pointing and uses data to direct constructive interventions. In the US, not enough women are submitting op eds. The Op Ed project finds and mentors awesome women in the topics where women are least represented. They offer support across the entire process of using a public voice in society, from opinion to opportunities and achievement.

With high resolution gender data, it's possible to:

  • Track the effectiveness of journalist mentorship programmes like the Op Ed Project
  • Match new writers with experienced journalists based on publication history (here, I'm drawing inspiration NYTWrites by Irene Ros)

Here's an idea for men: Bill Thompson at the BBC has a lovely personal rule for conference panels. If you ask him to speak at an event with few women, he will decline until you find more. He will also helpfully suggest women who are awesome speakers. We could do the same for journalism. Imagine that a site like Journalisted shared gender data on the subject and sources of articles. When someone asks me for a quote, I could check the last time the journalist quoted a woman (see my related work on source diversity in the Arab Spring).

Numbers have limited power to convince. My research assistant Sophie Diehl has been been developing emotive data art which conveys the human side of gender inequality in the media. While democratic ideals and the business case for gender equity are important, Sophie and I are trying to fill an important gap by using data to foster empathy.

How You Can Help

This initiative is my Master's thesis, the major capstone project for my time at the MIT Media Lab. I'm actively looking for partners on this journey: newsrooms, advocacy organisations, digital artists, communications researchers, and data scientists. For updates, sign up for the project mailing list. If you want to partner with me, email me at natematias@gmail.com.

How do we measure gender?

For our Datablog project, Lisa and I started by downloading every article published by the Guardian, Daily Mail, and Telegraph from July 2011 through June this year. The Guardian OpenPlatform was easy; Lisa wrote a python script which fetched and stored data straight from their API. I scraped the archives of the other two papers. In total, we fetched over 300,000 articles.

The date and section for each article was easy to access, even for the Daily Mail and Telegraph. Sections in the Guardian came from the "sectionId" tag. The Telegraph includes section headings in their archive, and the url for every Daily Mail article includes a section name. For those two papers, I also wrote a script to extract the bylines (author names) and article text.

Our first Datablog post focuses on byline gender over time across papers. To estimate an author's gender, I am using a database of baby names from Anna Powell-Smith, who assembled it from the UK Office of National Statistics. Using Anna's demographic data, I can guess if a name is likely to be male or female. For example, the name Amelia is obviously female, while Jackie is ambiguous (Facebook uses the same technique to estimate gender and ethnicity). To fill in the gaps, I added the gender of some of the most frequent contributors. For example, "grrlscientist" isn't a very common first name.

Using this simple technique, my software is able to tag articles by byline gender: male, female, mixed, or unknown. Most articles with unknown gender come from the newswires: the author field is filled with something like "the associated press" or "press association." A much smaller number include ambiguous names or names which aren't in the UK birth statistics. "Unknown" also includes a very small number of articles with empty bylines.

Are we measuring the right things?

Experienced media analysts may well be nonplussed by our initial blog post. Byline counts are the most crude way to measure gender in the news. Furthermore, section-based comparisons are fraught with problems: can we really say that the Guardian's "Comment is Free" is comparable to the Daily Mail's "Debate" section? It's much better to look into the content itself, studying topics, quotations, or the language that's used to refer to women. Because we have the full text of every single article, we can conduct queries on that text, extracting who's speaking, who's being spoken about, and who's being quoted. I'm also excited to see what emerges from our upcoming thematic analysis.

Content analysis alone can't tell us the role of women's voices in society. Voices need to be heard. For this reason, I'm most excited about a dataset we haven't yet released in full detail: social sharing information for every single article we collected. Getting the data was straightforward with the open source Amo softare by Knight-Mozilla Fellow Cole Gillespie. Amo can fetch all of the Facebook, Twitter, and Google+ sharing information for any web address. Using this data, we can draw conclusions about the reach of women's voices and the nature of audience demand associated with each news organisations.

Resources & Acknowledgements

This post is already too long; here are links to people and projects (please suggest more in the comments):