Data Mining with Criminal Intent: Using Zotero and TAPoR on the Old Bailey Proceedings


This project illustrates how the tools of digital humanities can be used to wrest new knowledge from one of the largest humanities data sets currently available: the Old Bailey Online.

Project Status:



Digging into Data Challenge, Jisc, NEH, SSHRC


University of Sheffield
George Mason University
McMaster University
University of Alberta
University of Hertfordshire
University of Western Ontario


British Isles, historical records, linked data, software development, text and image analysis, text data mining


API, Natural Language Processing

Project Website

Project Description

Over the past few decades scholars have increasingly used court records to illuminate historical themes in novel ways. The published Proceedings of the Old Bailey have been a fertile source for scholars working in these varied traditions, allowing them to use both qualitative and quantitative approaches to the evolution of the criminal justice system, of interpersonal relationships and human behaviour more generally. But, despite the fact that 120 million words of court transcripts published in the Proceedings are now available online in a structured and searchable form, historians and humanist scholars continue to use these legal records in an essentially iterative and traditional manner; and largely failing to take full advantage of the variety of forms of analysis the Proceeding’s online format allow. At the same time, in Zotero, a popular environment for managing online scholarship has been created that allows humanists to collect, index and manipulate large amounts of text; while in TAPoR Tools, a range of facilities for the quantitative analysis of text, has been piloted and tested. By bringing together in one seamless online environment, the text of the Proceedings, the functionality of Zotero and the tools created by TAPoR, this project will allow scholars to take new approaches to this old source.

This project will create an intellectual exemplar for the role of data mining in an important historical discipline–the history of crime–and illustrate how the fundamental conundrums of historical research on large bodies of text that have dogged humanist research over the last forty years might be addressed. By allowing the analysis and statistical representation of the types of language used in court and how it changed over time, and by comparing these `data mined´ patterns to those found in tagged data “With Criminal Intent” will achieve three things. First, a whole new way of charting changes in crime reporting and prosecution will be created; second, a new methodology for the consistent discovery of related descriptions will be benchmarked, and finally a working model of how large corpora can be handled online and in a distributed fashion, will be demonstrated. The significance of this project therefore runs beyond the discipline of the history of crime, and addresses historical scholarship more broadly, and scholarly engagement with large corpora.

Aims and Objectives

This project aims to demonstrate that greater historical rigour can be achieved, and new insights gained through the application of data mining and statistical analysis to large bodies of primary sources such as the The Proceedings of the Old Bailey. Given the availability and power of modern text mining techniques and the fact that the Proceedings have already been optimized for use with these techniques, we believe that by building on the success of previous work, this project will change the research paradigm. In the process, it will allow the end user, scholars and students, to experience the three separate components of this project (the Proceedings, Zotero and TAPoR tools) as a single seamless resource. To achieve this aim, we need to reach three specific objectives:

  • The creation of Newgate Commons: a new form of interface for the Old Bailey Proceedings that supplements the current search interfaces. The Newgate Commons will allow scholars to use mining and clustering techniques to identify, collect and work with, sets of relevant trials and related texts, and to extract them for further study with other tools. The interface will also make it easy for users to train machine learning `agents´ to help identify patterns in the text (and underlying account of prosecutions and punishments) of interest to the researcher.
  • The modification of Zotero Virtual Collections,the Zotero bibliographic reference management tool, so it can be used to manage the collections of documents created within the Newgate Commons and call upon full texts only when needed.
  • Voyeur Analytics: the project will connect Zotero to analytical tools designed by the TAPoR project to work on large collections, including the Voyeur toolset for analysis and visualization. The emphasis throughout will be on extending existing tools as needed to allow researchers to navigate between them seamlessly and to use Zotero as a hub from which to manage large study collections. In the process we will create the potential to analyze and visualize change over time in a way that goes beyond current historical methodologies, illuminating the relationship between text and event in new ways.

Project Team

  • Tim Hitchcock (PI – University of Hertfordshire)
  • Daniel Cohen (PI – George Mason University)
  • Geoffrey Rockwell (PI – University of Alberta)
  • William Turkel (University of Western Ontario)
  • Stéfan Sinclair (McMaster University)
  • Robert Shoemaker (University of Sheffield)
  • Jamie McLaughlin (Digital Humanities Developer – University of Sheffield)