Creating Technology for Social Change

Dispatches from #NICAR14: Hacks or Hackers?

I’m currently in Baltimore for the 2014 conference for NICAR (National Institute of Computer Assisted Reporting). In this series, I’ll be liveblogging the various talks and workshops I attend — keep in mind, this is by no means exhaustive coverage of all the cool stuff going on at the conference. For more, check out Chrys Wu’s index of slides, tutorials, links, and tools or follow #nicar14 on Twitter. Read on for my summary of a panel discussion of the legal issues around scraping for journalism. Thanks to Nathaniel Lash for his help liveblogging.

Jeremy Singer-Vine is a reporter and computer programmer at the Wall Street Journal, where he gathers, analyzes, and visualizes data.

Isaac Wolf is a DC-based national reporter for Scripps News. Wolf’s employed extensive data analysis and open records requests to uncover stories ranging from fraud in government programs to radioactive ladies handbags to the conditions of confinement for children held for years in adult jails. Wolf’s work has triggered federal and state investigations, policy improvements and a U.S. House Committee on Government Oversight Hearing.

Scott Klein is the Senior Editor, News Applications at ProPublica. He directs a team of journalist/programmers building large interactive software projects that tell journalistic stories, and that help readers find the relevance of complex national statistics to themselves and their communities. Scott is also co-founder of DocumentCloud, a service that helps news organizations search, manage, and present their source documents.

Tor Ekeland represents defendants in federal prosecutions under the Computer Fraud & Abuse Act. His clients include notorious internet troll “Weev”, currently serving a 41 month sentence for hacking AT&T’s iPad servers; former Reuters editor Matthew Keys who is under indictment for an alleged hack of the L.A. Times website; and Deric Lostutter, currently under investigation by the DOJ for his role in exposing the Steubenville rapists. He is also general counsel to YourAnonNews.


Isaac starts off the talk with an anecdote: through a Google search, he was able to find a directory of files containing unencrypted applications to join a subsidized phone program. When they scraped this data and told the company about this, they took legal action via the CFAA.

Isaac says he’s not going to tell people what to do. “When you see data out there online that looks really interesting, but it’s not clear that it’s intended to be made available for your use, what should you do? Should you proceed? It’s a very personal decision that each newsroom has to make.”

Next up is Scott, who says ProPublica has a policy against making straw people — this is not going to the front door. “The ethics are evolving.”

Finally, we hear from Tor, the only lawyer on the panel. “One of [CFAA’s] central provisions essentially prohibits unauthorized access to a protected computer.” Unauthorized access isn’t precisely defined, which means that “courts are all over the place on that.” It’s a very dangerous statute because it’s so poorly written. “Everyone’s who scraping data should be aware of it, because you may be committing a felony.”

“Most broadly, assume you’re going to be challenged,” says Isaac. “You’re going to be prodded by the company, by the entity, whatever it is. You need to have a narrative of how you found this information… you want to demonstrate that you’re using this information for journalistic purposes.”

He says making misrepresentations about yourself or having to break a password are “immediate bright red lines” that shouldn’t be crossed. In terms of scraping, you want to have a crystal-clear argument that the information was already in the public domain. Isaac has created a tip sheet with a list of questions you should consider before scraping.

“The theory is if the website owner decides retroactively that what you did was unauthorized, you have a problem.” Journalists get an extra leg up simply by virtue of their work. There have been instances where the government has gone after reporters, but a lot of it has to do with your organization and its credibility.

Isaac differentiates his Scripps colleagues’ scraping from Weev’s, as it was based on mass-downloading from publicly accessible URLs. Tor completely disagrees — the Department of Justice is not concerned with methodology of access, but “the fact that you accessed it without permission.” Instead, it’s merely “turning on the whim of a website owner.”

When you take this data, the definition of “personally identifiable information” is incredibly broad. “There are no clear lines here… you’re in a brave new world.Tor continues,” There are no clear answers here.”

“If you are going to be collecting sensitive information, you have to think about how to guard it or dispose of it,” says Isaac. “In terms of reporting, we spent a lot of time around these documents. We were going to do all our interviews in person… we wanted to make sure we were doing everything possible to [minimize the risk of identity theft].”

Scott says that sometimes the onus will fall on you to clean and anonymize the data — sometimes you just have to agree not to show personal identifying information or information that could be used to reverse engineer it. One ProPublica project merely mirrored PDFs from the FEC’s website. Somehow, parts of checks were included in these scans — thankfully, the documents were all hosted in DocumentCloud and OCR’d, allowing for a quick search-and-purge.

NPR aired a story a while ago about a reporter who made a fake name and applied for a loan — that is a misrepresentation and may not be defensible. “I’m not the ethics police, I don’t think there should be hard and fast rules, but I think you have to be very deliberate if you make misrepresentations about yourself.” You would have to be able to convince a jury that the misrepresentation was necessary and valuable to the story.

Isaac says, “In general, you should tell the truth, especially when you’re trying to collect information, potentially in ambiguous formats.”

With the Message Machine project, ProPublica looked at email targeting for the presidential campaigns. It would be easy for them to make up fake people and register for these mailing lists, but they decided early on that this was over the line. The reason why Newt Gingrich’s campaign is absent from Message Machine? To get on the mailing list, you were required to check a box saying, “I’m ready to take down the Democrats.”

Instead, ProPublica opted to “go through the front door,” getting information from real people, who became much better sources down the line. He says he doesn’t feel it would be morally wrong to create straw people, but having these sources definitely improves the overall story.

Tor says people tend to anthropomorphize computers, but essentially they just convert from input to output. Even so, Scott says these systems check for real people and how they voted, and would have likely thrown out any straw people because they don’t actually exist.

From the audience, Jennifer Valentino-DeVries describes the process of the Wall Street Journal story on how websites would vary prices based on user information. In retrospect, she believes it would probably be easier to collect this data through crowdsourcing. Scott clarifies that he’s not criticizing the story, merely making a point about evolving ethics: “There’s no journalistic ethics book that says anything about changing a variable in a cookie.”

Isaac says it helps to come in with an idea of the story you’re pursuing, rather than “blindly scraping,” Isaac said. In data collection, it’s best to have a track record of work to which you can point.

In the hacker community, the sentiment is that presenting bugs to the companies often leads to the problem being buried (the “black hat” argument). White hat hackers, on the other hand, believe it’s best to take it to the company.

Tor believes that had Weev brought the information to AT&T, he probably would not have been arrested. In terms of legal opinion, the circuits are all over the place with scraping, and with the internet, you’re often tied up in multiple jurisdictions. “It’s a total freaking mess.”