Creating Technology for Social Change

Tracking Twitter Accounts in the News

 

I recently worked with Nathan Matias on a project that I think reflects the modern relationship between journalism and social media: tracking the number of times Twitter accounts are quoted by news stories. For example, during the Arab Spring, journalist Andy Carvin used Twitter as a source to cover breaking developments in Tahrir square. GlobalVoices also uses social media sources extensively, as Nathan Matias charted in this post for PBS IdeaLab.

Inspired by these articles, we wanted to see if the trend continued. Were journalists continuing to quote people using information gleaned from their Twitter accounts? And if so, what kind of articles were these journalists writing? To that end, we created a web app that tracks the number of Twitter quotations over time and gives a sample of the type of articles with these quotations.

Project Development

Data Mining

We wanted to look at the trend of quoting Twitter accounts across all news stories. However, narrowing down news stories is hard. We couldn’t possibly canvass every news story. In that case, how could we decide what was a worthy news source?  And once we made that decision, how could we efficiently collect news stories from all of those sources?

To answer those questions, we deferred to Media Cloud, a project out of the Berkman Institute that scrapes selected media sources real-time. Rahul, another researcher at the Center for Civic Media, wrote a API Client that made using Media Cloud a snap.

Using Rahul’s client, we extracted all references to Twitter accounts in each story and stored them in our CouchDB database.

Visualization

In order to give a sample of the articles that quote Twitter accounts, we decided to simply show the most recently quoted people as well as all the articles they were quoted in.

 

 

Clicking the links brings you to the person’s Twitter account or the text of the articles they were quoted in. 

As you can see from the above screenshot, a lot of the Twitter accounts shown are those of journalists or news sources.  For example, one article — “White House Takes Eligibility Age Off The Table” — is attributed to the accounts of a journalist and a news source.  I’ll address this more in the “Wrinkles” section below.

We also created a graph using d3.js that shows the number of Twitter accounts that were quoted over the last 7 days. This graph will get more useful as we keep on scraping data (right now it’s a bit bereft), for then we could see the trend for the last 30 days, for the past year, etc. and we could also see if there was an event (like the Arab Spring) that relied heavily on Twitter accounts.

Wrinkles

As you may have noticed in the previous section’s screenshots, we captured a lot of journalists and news sources’ Twitter accounts, instead of just accounts of those who were quoted in the articles. This is because articles sometimes link to the Twitter account of the journalist. 

We have ideas on how to improve our method. One very rough way is to look at the proximity of the Twitter accounts in the text to quotation marks. The other is to exclude Twitter accounts that seem to be quoted too often—for example, the New York Times Twitter account will seem to occur a lot, and therefore we can exclude it from our database.

If any of you have feedback on how to improve our method or have ideas on how to set the threshold for excluding Twitter accounts, please let us know!

Further Steps

In addition to ironing out the wrinkles laid out above and continuing to scrape data, we also hope our project is extensible to other types of data. For example, we suspect that Facebook accounts aren’t getting quoted in news articles currently, but we can start keeping track of that in case the trend changes. 

Also, the underlying framework of our project (scrape and process data from Media Cloud, store it in a CouchDB instance, and serve and visualize the data in a web app) should work for many search terms. We searched for Twitter accounts, but looking for the words “gun control” or “Secretary of Defense” should work the same. If we had a method that could guess at gender using a name, we could track whether more men or more women are being quoted in the news.

Let us know how you’d use this and let us know your thoughts!