Improving Document Processing in Leaking Websites

I am examining document processing in leaking websites for my Intro to Civic Media project. Document processing is a particularly important step because it helps determine both the safety and the effectiveness of a leak. Unfortunately, it is also the most difficult step as it requires going through thousands or millions of documents to read, redact, validate, and/or analyze them. It also has the most variation as different organizations might follow different sub-steps or have different goals. Some release the documents without any processing while others might redact and validate the documents and then write articles to frame them.

What do different leaking organizations do in the document processing step and what are their goals? What tools do they use to make processing documents easier? What tools are needed to make leak processing easier? Who is involved in document processing and could or should it be more participatory? How well do existing practices work? Is there anything organizations would like to do or could do while processing documents that might make the leak safer and more effective? To answer these questions, I am interviewing people involved in leaking organizations and writing case studies of leaks and the tools used in them. I hope to synthesize these findings into my final paper.

I have created a list of interview questions to ask when I interview people involved in leaking organizations. Not all of them will be applicable to every situation and some might be answered as part of earlier questions. I also plan to customize these for each organization or document set. Additionally, I have a list of organizations and people to contact for an interview. I have started contacting people and hope to interview as many as possible over the next few weeks. Interviewees can be anonymous, pseudonymous, or completely identifiable. Ideally, I will post transcripts or recordings of interviews online. If you are involved in a leaking organization and are interested in being interviewed, please email me at

To track my research, I have made a wiki called LeakWiki. LeakWiki is a resource to make as much information as possible about leaking websites, their successes, their flaws, and related topics public. I will post case studies of organizations and document sets and profiles of tools used by leaking organizations on LeakWiki. These case studies and tool profiles will be based on interviews and other research I conduct. I might interview some of the creators of commonly used tools as well. For this project, I will focus mostly on the document processing step and the tools used for leak processing. Later I will expand my case studies, add more tool profiles, and discuss other topics like policies and legal cases on LeakWiki. Anyone can contribute to LeakWiki and should feel free to do so.

From my interviews and LeakWiki, I hope to better understand how document processing works, who is involved, what tools are used, and how it can be improved. I will be posting a paper publicly at the end of the semester about my findings and ideas on how document processing could be improved in the future. Afterward, I will expand LeakWiki and do the same for other parts of the leaking process.

Please let me know through comments or email if you have any suggestions or ideas.