Since my last post, I have also completed an interview with Juzne vesti and a case study on Associated Whistleblowing Press. I am currently waiting on some followup questions with Public Intelligence and question responses with someone from WikiLeaks. Additionally, I have been working on my paper which is in large part based off of my blog posts, interviews, and case studies. So, which strategies for document processing are most effective in light of the differing goals of leaking websites? And how have leaking organizations been improving these methods?

Document processing can be roughly divided into three different steps- verification, redaction, and analysis/formatting. Verification is the step where documents are validated and cross-checked to be sure they are not forged. Redaction involves removal of names, metadata, and other parts of the document. Analysis and formatting involves anything done to make the information more accessible or understandable. All of these steps contribute to the effectiveness of a release but the approach to a step and its importance varies based on the goals of the leaking website involved.

There seems to be five different approaches to verification for leaking websites that authenticate documents. First, the staff of a leaking website could file a FOIA request for a document or otherwise request the document directly from the organization that created it. In many cases, this is unlikely to work but provides high certainty of the validity of the document if it does work (although the organization involved could have created a fake document in the first place). This can also be a powerful advocacy tactic for promoting free information, protecting whistleblowers, and increasing visibility of corruption. Second, the leaking website staff may have obtained the document themselves or gotten it from a source they trust. This tactic is useful for high certainty of validity but it is difficult to prove to the site’s readers. Third, the document could be verified indirectly by the organization that created it. This can result from the first approach when the document is not released but its existence is verified. Leaking organizations can also call people from these organizations and ask them for comments on the document. If they comment, the existence of the document is verified. This method is particularly useful if a leaking website has information about the document but not access to the document itself. While not as accurate or precise as the first two methods, the indirect verification method can be attempted in most cases.

The remaining two methods rely less on response from or access to the organizations involved. Fourth, the leaking organization can cross-check the document for evidence of forgery based on the content. This could include searching for inconsistencies or inaccuracies in the historical or political facts mentioned. Conceptual cross-checking may also include analysis of motives for and the difficulty of forgery. A fake document can certainly get past conceptual cross-checking but it has to be a very good fake. The conceptual cross-checking validation method also requires people who are very knowledgeable about the topics relevant to the leak. Fifth, the leaking organization can look for electronic evidence for forgery. Electrronic cross-checking can eliminate obvious fakes but Associated Whistleblowing Press noted that electronic forgeries can be “almost perfect”. The exact methods used in electronic cross-checking were not specified.

Verification seems to be the area with the least room for improvement. From my interviews, it sounds like some level of verification generally can be achieved if the leaking website involved wants to be sure any given document is not forged. Still, there are some experimental tools for fact-checking and verification. One example is factchecking.it, a platform for crowd-sourcing fact verification. This tool is still in the early stages and may not entirely replace the document verification methods above but it could be a good companion to crowdsourced analysis on leaking websites.

While only some sites verify the authenticity of their releases, some level of redaction is near universal. Even the leaking websites that tend the most towards radical transparency, like Cryptome, generally still remove some names for protection of sources and others. In large part, removal of names and other identifying information seems to be done by hand. This can be horribly inefficient and there is a high risk of missing some information. The second part of the redaction step is metadata removal. Metadata in a document could contain information that would lead back to the source like their name, the time they accessed the document, etc. Some tools are used for this but much of the metadata removal in leaked documents seems to be done by hand. Unless a document is in a rarely used format, there is no good reason for this.

There are tools available for removing metadata from files of many different formats. Additionally, GlobaLeaks, a document submission platform used by some leaking websites, has plans to integrate an existing tool for metadata removal into their platform. While I do not know of any tools currently available for redaction of names and identifying information, one could easily be made and tested.

Analysis of documents is best discussed separately for websites focused on fixing wrongdoings and those primarily interested in promoting transparency. In leaking websites focused on promoting transparency, few tools are necessary. Most of the work in this category is formatting for publication or sometimes constructing an analysis of a document based on quotes and points within the document (for which no tools seem to be available). The goal for sites focused on transparency here generally seems to be readability and accessibility.

More analysis is sometimes needed for leaking websites that seek to raise awareness of or correct wrongdoings. Some have used Document Cloud and similar systems to track documents they are analyzing. Much of the analysis process in these sites is also done by hand in the form of articles by media partners that place the documents into a narrative. This strategy seems quite effective if the articles are read. Juzne vesti in particular has reported wrongdoings being corrected as soon as few hours after they post an article.

Current analysis approaches seem to work well for all but the largest leaks. For those, leaking websites have experimented with crowd sourcing. Crowd sourcing has debatable usefulness. Some leaking websties tried crowd sourcing and failed to get meaningful contributions. many crowd sourcing platforms are filled with spam. A few leaking websites still hope to find a way to use crowd sourcing. There is one notable new tool in the area of crowd sourcing leak analysis, cablegate.awp.is. This tool is setup by Associated Whistleblowing Press to test the viability of crowd sourcing in leak analysis using the WikiLeaks cablegate documents. cablegate.awp.is allows searching, tagging, and writing articles and wiki pages about cables. It also has a timeline to show when cables fitting certain search criteria were created and a graphing tool for visualizing connections between cables. cablegate.awp.is launched less than a week ago so it is difficult to tell how useful it will be but it is an exciting tool to watch and test.

The current approaches to leaking work for most releases but leaking websites still struggle with very large document sets. Automation and crowd sourced analysis could be crucial to making these large releases safer and more useful. However, these releases still seem to be in the minority. The best application for new tools and leaking techniques may be to make whistleblowing-focused leaking websites easier to create and run with little time or technical expertise. Platforms like GlobaLeaks are starting to make this possible but there are many potential additions and improvements in this area.

Simple, efficient, and useful leaking platforms could be used not only in leaking websites but also improve freedom of information release processes in governments, companies, and other organizations. Leaking platforms could be used not only to speed up the process but also to facilitate proactive instead of reactive disclosure. This could reduce the need for leaking websites and greatly increase both the amount of information available and the usefulness of that information.