Dispatches from #NICAR14: Jennifer LaFleur and David Donald on the data-driven story | MIT Center for Civic Media

Dispatches from #NICAR14: Jennifer LaFleur and David Donald on the data-driven story

I'm currently in Baltimore for the 2014 conference for NICAR (National Institute of Computer Assisted Reporting). In this series, I'll be liveblogging the various talks and workshops I attend — keep in mind, this is by no means exhaustive coverage of all the cool stuff going on at the conference. For more, check out Chrys Wu's index of slides, tutorials, links, and tools or follow #nicar14 on Twitter. Read on for my summary of Jennifer LaFleur and David Donald's presentation on conceiving and launching a data-driven story.

Jennifer LaFleur is senior editor for data journalism at The Center for Investigative Reporting. Previously, she was the director of computer-assisted reporting at ProPublica and has held similar roles at The Dallas Morning News, the San Jose Mercury News and the St. Louis Post-Dispatch.

David Donald is data editor at The Center for Public Integrity, where he oversees data analysis and computer-assisted reporting at the Washington-based investigative journalism nonprofit. Prior to the Center, he served as training director for Investigative Reporters and Editors and the National Institute for Computer-Assisted Reporting. His work has been honored with the Philip Meyer Award, the IRE Renner Award and the Robert F. Kennedy Journalism Award.

Brant Houston (moderator) is the Knight Chair of Investigative and Enterprise Reporting at the University of Illinois. He was executive director of IRE for more than a decade. He is author of "Computer-Assisted Reporting" and co-author of the 5th edition of "The Investigative Reporter's Handbook." He worked as an investigative reporter in news rooms for 17 years and currently is involved in developing online nonprofit investigative reporting centers and networks in the U.S. and globally.

Any good data-driven story starts with research. David kicks off the talk by giving some suggestions of launching points for these stories. One place to start is the IRE Resource Center, which allows you to find tip sheets on previous stories. Depending on your newsroom, you may have access to experts and/or special databases. Academic and government papers also provide a ripe source for potential stories.

Jennifer takes a closer look at the national bridge inventory data, which has been used for stories all around the country. “It’s good to have the data on hand,” she says, in case a bridge collapses — you’ll already be one step ahead. The NICAR data library has “incredibly localized data” that can be very helpful, adds David.

“When I start a new topic, I go into the IRE research center, get every tip sheet on that topic, and read it,” says Jennifer. “I’ve done my due diligence before I do any analysis.”

In terms of getting the data, you can always start with the NICAR data library, but often you have to check the agency website, and — even more often — you have to request it directly from the agency or even create it on your own. David notes, “Data is often data that hasn’t been released before.”

Jennifer adds a caveat to data collection — before you request data from an agency, know the law, particularly concerning electronic records. “Do your homework.” You have to stay aware of the appropriate costs and make your request clear — and follow up on it. “Get to know Leon — the guy who works in the basement of the agency, and nobody’s talked to him in years.”

When there is no data, use your sources! You can employ various techniques to do this, such as sampling, building data from documents (i.e. scraping), or creating surveys and questionnaires.

“There’s no such thing as immaculate data,” warns David. “Every piece of data has been touched by human hands. That means ripe for error.” Always assume that there are problems in the data — interview it. Jennifer backs this with a story about flawed data from the Department of Education, an agency you would assume would bulletproof their data. “Don’t run with scissors. Paranoia is your best friend.”

How does one go about interviewing the data? Think through some of the basic things: how many records should you have? Check other counts. You can (and should) double-check these against summary reports. Check for duplicates and consistency-check all fields. For example, keep an eye out for records with slightly different naming (e.g. “road” vs. “rd”).

Additionally, make sure you understand which fields to look at — don’t assume you know. Oftentimes, there are relationships between the columns; they don’t necessarily operate in isolation. The name on the top of the column doesn’t necessarily mean what it says. “Assumptions about things, even something as simple as amount, can be wrong.” Different people handle this data with different conventions and definitions.

Check the ranges of fields — for example, with dates, ask yourself: is this possible? Was a bridge built in the future? You should always check for missing data or blank fields. Are they real values, or the result of some kind of error?

Now, we take a closer look at the columns in the bridge data. The data has an overall assessment of the bridge condition, but also contains analyses of the substructures. “It’s a good to look at all the individual pieces,” says Jennifer. Words and phrases don’t necessarily mean what they do in common English — there may be specific engineering definitions for different status ratings.

You should also keep track of changes over time that may create problems in the data set. Did the agency in question change their standards or record layout? This can cause major issues down the line. One way to do this is to look at changes from year to year. Jennifer recalls doing this with a data set on truck accidents, where Houston’s count significantly decreased one year. This is a red flag for improper reporting. For something like bridges, you can go out in person and verify the data.
“Essentially, it’s a reality check. Every database is a reduction from reality,” says David. “Reality changes over time, and the data might not reflect that.”

In terms of methodology, don’t recreate the wheel. Use the resources and experts you have available, and keep notes as you interview and analyze your data. You should use a standard naming convention for files and tables. “If you ever put the word final in one of your file names, be ready to make corrections.” Think about certain records that might need to be excluded because of problems.

You can borrow methodologies from social scientists and other analysts, so that “it’s not something that’s just cooked up.” Review what an agency has already done and interview them. Come prepared to these — you shouldn’t start from step one. You need to have read all the documentation, so they’re on your side. When Jennifer was at ProPublica, they would write white papers and send it around to experts to get feedback. You should develop a hypothesis to test.

Now we go back to the bridge data. “If you think something is weird, it probably is.” When bridges didn’t have year data, 1900 was used as a placeholder. You can map or graph out the data to take a quick look at it and explore the different facets. David introduces us to analysis of variance, which can be very helpful when analyzing data at a county level. This technique allows you to analyze the differences between group means and their associated procedures, thus enabling you to look at both categorical and continuous data simultaneously.

David shows us some records he’s pulled from the bridge data set and dumped into SPSS. He compares bridge data from four counties — Bexar, Dallas, Harris, and Travis. Having a complete data set opens of a range of possibilities for analysis. “It’s not a sample, we have a population.” Data is really good with who, what, where, when questions, but terrible with why questions.

Using analysis of variance, we can compare the mean sufficiency ratings (a test for homogeneity) for bridges in four counties. With a significance value of under .05 (the 95% confidence level — a standard in social sciences and engineering), we can see that the differences are indeed statistically significant. David walks us through the rest of a basic analysis, but suggests taking a course/boot camp in statistics for a deeper dive.

That being said, statistics is not the only approach to a story. Sometimes, you should go to the documents. Other times, you need to go to the people or go out into the field. Even without substantial statistical training, you can still do a data driven story. For example, with sufficiency ratings, you could ask experts, “What is the rating at which people should start worrying?” You could also look into repair cycles and procedures — you can get stories out of data with minimal analysis.