What stories do data tell?
Senior Fellow, Schuster Institute for Investigative Journalism
I am a journalist, deeply interested in social justice issues. I am currently working on trafficking and debt bondage issues. I have started work as a researcher with MIT’s Opendoclab, which has been a great way to explore new and interactive forms of story-telling tools and reflect on what needs to change in the practice of traditional journalism.
I worked for The Times of India for about 18 years. In 2010, I came to Harvard Kennedy School for graduate studies. I stayed on at Kennedy School and assisted with the teaching of a course on mobilization of social movements. I am writing a case for teaching at Kennedy School on debt bondage. In 2008, I was selected Nieman Fellow at Harvard and in 2010 as a Mason Fellow at Harvard Kennedy School.
What stories do data tell?
What stories do data tell?
As media makers and members of the public, we increasingly have access to rich data sets that contain information about our communities, our politics and our government. Our panelists explore their strategies for finding and sharing stories embedded within sets of data.
Was Kony 2012 okay in seizing attention for a cause?
Moderating the panel is Emily Bell, Tow Center for innovation in journalism at Columbia University.
- Dan O’Neil is the Executive Director of the Smart Chicago Collaborative, which funds and fosters civic and municipal innovation.
- Kara Oehler is editor-in-chief Zeega Does innvoation projects on research and design of story-telling
- Jonathan Stray, project lead of the Associated Press's Overview Project, an open-source tool to help journalists find stories in large amounts of data.
- Laura Kurgan is a the director of the Spatial Information Design Lab at Columbia University. Her project Million Dollar Blocks maps the geography of incarceration in the US.
Emily flew back from London last night to join this conference. Emily finished a 24 hour travel and 5 hour train ride to be here. Everyone will talk and show examples of what they are doing.
Laura starts off by showing us some beautiful visualisations of migration around the world, developed with data about remittances -- small amounts of money sent home by migrants. The World Bank reports that in 2010 the amount of money in remittances was 3x the amount given in development aid. Laura did not simply use the WB data, but also looked at how many migrants moved from country to country. “This is qualitative data—storytelling data,” explains Laura. She begins to play around with the visualisation, showing migration from Mexico to the US. So much money, she says, does not travel through visible data sources. Laura begins to cycle through a list of countries to portray remittances by country of origin.
Using Zeega, one of Laura's students filmed the stories and street vendors who are sending money across borders. Laura asks the audience if they are familiar with the city of Touba in Senegal. Only a few hands go up. A city of 2 million, Touba is a religious center for Mouride Islam. Thanks to a transnational network of disciples, Touba has experienced significant growth. This activity is most prominent in New York, where a large Mouride community of stores and street vendors transfer $15-20 million in remittances annually.
Now Laura starts talking about her Million Dollar Blocks project. Her Spatial Information Design Lab is a think-tank focused on three areas: data literacy, new data visualisation & collection tools, and making data public. Putting data on a map can open space for action. Laura says that data literacy is an area that few people focus on. Collaborating with a criminal justice activist, Laura has been exploring the relationship between architecture and justice, working in New Orleans with high school students and other areas. The premise behind architectural justice is that data is always designed, and that the design of data implies a policy which has an effect on the built environment and the people living in it.
Laura tells us a story about crime and geography. The prison population in the US keeps going up; where do those people come from? Using home addresses of the incarcerated, Laura has been contextualizing incarceration data by mapping it out. Brooklyn has the highest incarceration rates, but crime appears to be geospatially dispersed. Laura explains that crime geographies lead to crime prevention tactics, while prison geographies can lead to community investment strategies. She walks us through the data: in Brooklyn alone, incarceration cost 315 million dollars—certain single blocks cost millions, hence the name of the project: Million Dollar Blocks. In particular, Laura hones in on Brooklyn's Community District 16, which recorded incarceration costs of 11 million dollars. 40 percent of those incarcerated returned to prison within three years.
Synthesizing the available information, Laura has projected a full scenario of the future. Comparing data sets of neighbourhoods and prisons, she says it was shocking what is happening in New Orleans. Prisons are being fortified and getting more beds, while public housing was being destroyed. They started to work with non profit organisations. Within one neighbourhood there were a lot of non profit organisations who had not talked to each other although they were working in the same area.
They also worked with the department of probation. Laura shows us a slide of what probation looks like. People wait for hours to meet with the probation officers. They did a project with probation officers on service design approach in order to improve the client experience.
Next up is Kara Oehler, who tells us about her work with Zeega, a non-profit based in Cambridge dedicated to interactive storytelling, named after filmmaker Dziga Virtov. We tend to think of databases and narrative at opposite ends of the spectrum, she says. Zeega is an open source interactive editor which brings them together. Kara shows us a rapid series of examples from the Zeega interface, showing us animated gifs from archives, audio from SoundCloud, and combining them together onto a shared canvas.
Zeega has also has been used to approach the web as an archive, says Kara. The engine can explore the web in different ways: handling tweets, adding tags to images, viewing images on a map and drilling them down to different topics, on both a macro and a micro level. Kara demonstrates the system’s archival potential by searching for information on Fukushima that can be turned into a collection. She shows a slide of radiation levels at Fukushima.
One of the key questions at Zeega is how data can be put into statistical frames. Kara has been working with 8 different producers to bring participatory projects to life. For example, she has been collaborating with Tod Melby—producer of Black Gold Boom—to tell personal stories on big topics; for example, that of a person who had not slept for 111 hours.
At an upcoming hack day, the Zeega team is planning to take these personal stories and contextualise them with data. For example, did energy drink sales increase? They plan to make these personal stories more dynamic, with constantly updated data.
Up first are strategies for finding data: He emphasizes using search. Search is your friend, he says. “ I prefer it to asking”. He was responsible for data acquisition for Everyblock. Dan reemphasizes the importance of searching for data on your own. “Asking people for data never got me anywhere,” he says. He gives the example of Dallas crime report, which has several years of data on its website, but you need to look for it. If you look hard enough, it has the best crime data. And it has narrative. After all, cops record everything that happens. Data has more structure than what you would imagine, he says.
Dan created data using text to see why sources were granted anonymity. This is an idea he has been working on since 2005, after the Jayson Blair episode at New York Times. Dan shows a slide that shows a list of reasons for allowing anonymity. He also points us to the Data Journalism Handbook -- lot of people helped create it here. On a lighter note, he says it is highly focused on asking for data.
Dan makes the point of how context is important in data. Publishing data without context is not super-useful. Why is most data is boring? Dan says it’s because data is made by people, and most people are boring most of the time. Hence the need for a new model of presenting data: ” I use data to tell stories.” He tells us the story of his incredibly detailed post on a Walgreens in Milwaukee, including information from 10 different data sources.
He gives details of how he developed the story and how he got the data. You have to get as much data as you can, he says. For instance, in this case, he even got building permits from the city of Chicago. There could be many interesting details in this data. Building permits is boring data but there could be exciting details embedded in it. He found a building closed for 20 years and open for three days. He talkabout other resources. Sanborn maps are amazing resources on land use and building use. Original photography is data. He looked in New York Times archive found materials there. He looked for the word JImmied in advanced search and found it was useful data . Police had used it several times. This could be used for further search.
Dan argues that we need to start embedding data in stories. We also need to take personal responsibility for our own data. In the crime records, people call the police and lie. Crime data full of amazing lies. Abstraction of data that is not useful. He gives an example of relationship between human beings and data. If you are looking for how many planes are struck by birds? The data available is terrible. Reporters wrote about this and then data on these details was released and it was found the San Francisco International Airport was terrible because they had good data. There is a relationship between human beings and data. He shows a visualization of one month of the Iraq War logs. He talks about files released by Wikileaks. His work came out of dissatisfaction with the reporting process, as all media went to their reporters covering Iraq to interpret this data. They then drew up a list of terms -- put them into search engines and wrote stories. “I was unsatisfied with this.” He talked about how immense this data was. The initial release itself was 90,000 reports. You can’t read them all. He quoted a paper by Steven Ramzey, that emphasizes this point: "The Hermeneutics of Screwing Around"-- what do you do with a million books? You can’t possibly have read more than a tiny fraction.
How question then was, how do you interpret this huge information, especially when everyone brings their pre-conceptions into it. He shows a slide of his work with the Wikileaks information, which showed thousands of multiple colored dots. Every dot on the slide is one report, he says. Each color-coded dot means something. For instance, he says, the blue is for a civilian crime, dark green for friendly actions, etc. Reports with similar language are clustered together.
There were multiple stories emerging from the data. Most importantly, what this visualization showed was in December 2006 -- what was going on was Iraqi civilians fighting each other. There were more stories there. Clusters labelled as female portrayed another story. Clusters labelled as truck or tanker and where there were explosions were about people blowing up tanker trucks—yet another story.
There is more happening in the world than we can pay attention to. And it is available online. How do we navigate through it? Choosing what is news and what stories to tell has become critical. The Overview project is a tool for investigative journalist to make sense of large information. Jonathan believes that have a language which is good what is happening in the world -- human language. He wants to derive meaning from that. This is the overview project. He looked at what private secrity contractors did during the Iraq war -- over 400 pages. He was trying to find the answer, what was the typical day. Did the policies change? Did the actions change?
He tells the audience, when you deal with data, you have to ask about sampling: what data means, large collection of similar items, what is included and what is not included. The other theme is how you draw a conclusion when you take a data set. Graphical inference for Infovis -- slide. He talked about Null hypothesis, that illustrated that some people will get better on their own, whether they are taking a drug or not. He applied the same to data sets.. How do we take, there is nothing there hypothesis into visualization.
He shows a slide with six sets of data. If humans cannot decipher it, then there is nothing there. It is just noise. A lot of stories we are telling from data are false. To illustrate his point further, he showed a line chart with two lines from a data journalism workshop at Oregon. Students chose data sets, found a story and plotted it. One group looked at data on bicycle thefts and commuters. they plot it , see bicycle commuters are going up and bicycle thefts are going down. Students said they weren’t sure if they were right. this was discussed in class. This issue came up whether the data was correct.
He asks people to yell out, how to check it. It took 10 minutes to get it from journalism students before they came up with the real point: to talk to the person who created it. In this regard, crime data is notorious. Police is intentionally or unintentionally producing bad statistics. Different people count crimes different ways. There is the issue of data is not truth, It is made by humans. Even if accurate, there is no such thing as pure data.
He also asks the audience: does anyone have an idea of the editorial choices made for the data? There were lots of responses. Some were on color coding. Color is important too for an international audience, he says. The stories don’t come from nothing. There are human choices about what matters and where we look. While putting stuff in the world into condensed sketches the meaning comes from selection and editorial process. He said one should watch for sampling and where it came from and choices that were being made.
It was back to the moderator, who asks, Where do we need to start with this? Data has a context, we know.
Jonathan has an idea for a book which covers the strategies we use to find a story in a dataset. What can we learn about best practices from other fields? Laura argues that you have to know what you are looking at to make use of data. She points to the paper on Graphical inference and infoviz and wonders if it would be possible to find the patterns without a familiarity with Texas. Jonathan responds that the paper found that after surveying 100 people, participants were able to find the patterns.
Emily asks how we can approach data in the practice of journalism. Dan thinks it needs to be baked into journalism education; journalists need to become adept at finding and analyzing data. Kara points out that people who write stories need capable friends in building algorithms that can do the things you can't.
Laura says data means different things to different areas of inquiry.
Emily throws it open to question and answers from the audience.
Someone asks if anyone is crowdsourcing the annotation of data. Jonathan responds that government data is often well annotated. For example, the Canadian stats agency will give you the numbers but also 30 pages of how they get it. If you ask the question about GDP it is solved, but if it about bicycle theft, it may not be. Emily points us to Matt McAllister's very long list of collaborative journalism projects.
Someone from the floor asks: “I often do FOIAs, or have interns do FOIAs. But my question is: Does the data come first or the narrative come first?” Dan says, “I don’t know.” You have start by figuring out what the problem is, what the non-boring thing is that you want to write a story about, then figure out what the data says and drill down into it.
Laura Kurgan: It is impossible to know everything. It’s very hard to separate correlation from causation.
Kara Oehler: We learn through narrative, stories, examples of things. Even looking a set of statistics, you’re looking at it because you have an interest in a particular story coming out of it—for example, with the prison data sets.
Jonathan says, You have to have a interest. You want to learn something and not just project what you already believe. “If you’re looking for data to support an idea you have, you will find it. There is a lot of data, there is noise, there is error, there is lies... You have to ask, can I find data that proves otherwise, and then look at it. That’s the scientific method. Believe me, you can prove anything you want.”