I’m at the 2014 Computation + Journalism symposium at Columbia University. Here’s a quick intro to what we are talking about:

“Data and computation drive our world, often without any kind of critical assessment or accountability.

Journalism is adapting responsibly—finding and creating new kinds of stories that respond directly to

our new societal condition. We are excited that you can join us for a two-day conference exploring the

interface between journalism and computing.” (Symposium program)

Here is a live blog account from the first panel which consisted of three papers about Algorithms and Data-Driven Storytelling.

The Story Discovery Engine: Artificial Intelligence for Public Affairs Reporting

Meredith Broussard, Assistant Professor, Temple University

She shows a slide of Data from Star Trek. There is no such thing as Commander Data. Most of what we know of AI comes from the movies. What’s real about AI is what we can make with code, algorithms and data.

It’s hard as a reporter to come up with original story ideas. She was a reporter and wanted to move into educaion reporting. She had reported in a number of areas and wanted to write a story about textbooks but didn’t have the sources. She wanted to talk with people at best schools, worst schools but didn’t know where to start. She built some software to solve her reporting problem.

She also got a tip from a source who said that you could send your kid to a reputable school as long as you did fundraising for books. She wanted to see if there was a relationship between book shortages and the fact that students haven’t been able to pass standardized tests in Philly public schools.

She shows a slide of the education system as a funnel. Standards like the Common Core lead to Scope & Sequence at the local level. This is a document that teachers read that should correspond to standards and helps teachers know what the students should know at different points in the school year. Then Branded Curriculum goes along with these. The top three companies have a 35% market share for educational materials.

So she still wanted to write a story about the education system. She built something based on a concept from AI called an “expert system”. The idea is that you can ask a box advice and the box gives you answers as good as human ones. This idea fell out of favor because human intelligence so far supersedes black boxes based on code. She revived this idea and instead of having advice coming out of the box, she made it so that data visualizations came out of the box. It’s a story discovery engine that a reporter can use to discover a story around public affairs. Could be used for education, public affairs or other topic area. Any area where you have a set of well-defined rules and way of operating.

A reporter is good at looking at what should be versus what is. In this case, the rules are articulating in laws and policies so that’s why it works well for public affairs. Her prototype can be found at stackedup.org and the code is free on github.com. The site has a couple different components. She wrote 7 stories that were published in various places, most recently on the Atlantic’s site. The “Reporting Tool” column helps you discover the stories. You can see how much textbooks cost for all of the schools or one of the schools. It shows that, for example, for a single textbook a school district would need to spend almost $9,000 in order purchase the needed books. You can also see whether a particular school that is missing the required textbooks is underperforming on standardized tests in comparison to the rest of the school district. You can see both that there are not enough books and that the school is not performing well. She is suggesting that these things are related. The tool addresses 241 schools and every grade in that data set.

She discusses the impact of her stories and her tool. There was a lot of media attention and one of the administrators responsible for book shortages were fired. It turns out money for textbooks had gone missing. Another senior administrator also left. In the process of this, the school district discovered a secondary problem with financing textbooks and has taken steps to remedy it.

She thinks of this project as using Big Data for social justice. She worked with three developers over six months to build this. They built it in a way that can be applied to any school district in the country.

Questions from the audience:

Q. You stated in your article that the school’s inventory data turned out to be missing. what is the impact of that when the data is wrong?

A. One of the things that is fascinating about data-driven work is when you explore the places where the data is missing.

Q. A lot of time legislation is written in a way that doesn’t easily lend itself to code, how did you deal with that?

A. There is not a strict translation of legislation into code. I had to take apart the legislation itself and rewrite the rules. That’s a human intervention. It’s not scalable but it’s necessary.

Transparency and Interactivity in Data-Driven Rankings

Nick Diakopoulos, University of Maryland

Stephen Cass, IEEE Spectrum

Joshua Romero, IEEE Spectrum

Nick Diakopoulos wants to share his research around rankings and creating ways to share rankings more interactively and transparently with the public. A new ranking came out earlier this month. LinkedIn analyzed millions of users on the platform. It looked at rankings for media professionals and showed that New York University, Hofstra and Duke were the top three. US News and World Report ranks these places usually much lower. This is surprising. Rankings are a mystery. We don’t really know what is driving the ranking behind these things.

US News & World Report does publish the criteria and their weighting, but LinkedIn publishes a blog post. They say they analyze people who end up in “desirable companies”. It’s difficult to understand that and then compare to US News & World Report. They started working with IEEE (largest professional org for electronics and electrical engineers) to rank the top programming languages.

IEEE & Nick ranked the top 10 programming languages. Java is at the top and the list includes Python, Javascript, PHP, Ruby, R and C. They had a multitude of criteria, including what is most widely used, what is trending, what platforms the language works on. Their task was to open up these rankings and provide insight into how they ranked these things. Users could drill down into a ranking and see that Google, Github, and other sites were driving that technology’s particular ranking. Users could create their own rankings and compare those to how the IEEE-defined system ranked languages.

They wanted users to be able to understand through creating their own rankings.

What was the response from the public to this report?

A couple thousands tweets and likes and a couple hundred comments. 1285 tweets translated to 884 original tweets. Many of those were in the editing interface. 127 of those were people who had gone in and tweaked the rankings in various ways. 6 of them had tweet text that referred to how they had edited the rankings. The team did not have in-app metrics to assess engagement.

Comments on the app were interesting. People were critiquing the editorial decisions the team made. Why is a language missing? Why is SQL a programming language? People pointed out things that were missing, definitions, classifications and methodology among other things.

Take Aways

Only 16% of tweets interacted deeply with the editing/filtering capability. This raises the question as to how much transparency in ranking systems really necessary? When do people care enough to step in and trying to know more about the behind the scenes.

He takes the other points as design criticism and is reflecting on the editorial decisions they made.

The edge cases provide an opportunity for co-constructing meaning with the users of the visualization.

Regulatory Breakdowns in Oversight of U.S. Stockbrokers

Rob Barry, Wall Street Journal with Jean Eaglesham

Rob leads by talking about a real-world process of gathering data and turning that into stories, including some of the hurdles and analytical problems. For him, the most difficult part was obtaining the information.

There are 634,000 people who are licensed traders. FINRA is one of the financial industry regulators. Many of us might think the job of a trader would be done by etrade today. It’s interesting that there are still hundreds of thousands of people who call people up and try to get them to buy securities. Those traders fall under an opaque regulatory regime. FINRA used to be the NASD and then merged with the NY Stock exchange. It’s made up of the traders AND regulates the traders. It’s a non-governmental entity so there is no way to request information from them. So if you want to know a list of traders that have criminal pasts, you can’t get that.

But, for every stock broker, FINRA puts up a dossier of their histories. They forbid you from “scraping” the data and you can’t get the data by asking. So their question was how could they get that data? The data is actually stored in a state repository called the Central Registration Depository. Every state has jurisdiction over the individual brokers in that state.

However, every state told them they didn’t have access to that data. Rob’s team had to actually work with them individually and tell them how to navigate the interface of the software, which buttons to click on and where the data lived. He shows a completely incorrect letter from the Attorney General of New York that states that the state doesn’t have the data they were requesting. Rob shows how just prior to that letter, the state of NY had made exactly the opposite case in a court case.

They got CD discs from all the states and he shows a photo of all of those disks. The data came in all of the different formats – XML, CSV, Access, Excel. They got 110 million rows of data in total. They had to consolidate the data into a single database they could query. They had to eliminate duplicates. They got around 550,000 individuals identified from the overall field of 630,000 that they knew existed.

They were interested in the “disclosures” like criminal record, bankruptcy, customer complaints and liens. For the first time, they were able to see who the people are. 13% of people had at least one red flag. Three or more red flags – 2% or 10,000 – fit that category. A couple hundred even have more than 10 red flags in their history.

One of the things they did was follow the brokers as they traveled from firm to firm. A lot of them were located in Long Island. A number of individuals from one firm all moved to another firm at the same time. They also cross-referenced all 550K individuals with public records databases. They found a pattern of “cockroaching”, meaning brokers scuttling from firm to firm. They also showed how individuals dodged arbitration claims by closing up shop. They saw this pattern in the data.

Rob is interested in this because of the lack of disclosure. He sees data journalism as a way of going around this lack of disclosure to get at information that should be public.

Questions from the Audience

Q: What lessons learned came out of this practice? Repeatability and scalability are mantras of computation. Is there something you could share around how to repeat and scale this kind of work?

A: I’d be embarrassed to share the code with you guys. We are looking to make more information available in the coming weeks. The process of asking for the information. There are tools for filing FOIAs but that stuff needs a lot of work. Dealing with government in bulk is hard.

Q: Paul Resnick asks if you had all the computer scientists in the room, what would you have done differently?

A: Rob would like to have more sophisticated data analysis techniques at his disposal. He doesn’t know the acronyms like ‘LDA’. He wishes more CS people could work in newsrooms but the pay won’t work.

Q: Are you making the data available? How could people re-use this?

A: The question of reuse is really hard. Internally it’s hard to even tell your colleagues. Sharing externally – we want to be very careful with public records. We are going to make more data available but it’s not a decision that I have the jurisdiction to make.