Session 14

Friday 14:00 - 15:30

High Tor 3

Chair: Jamie McLaughlin

Unlocking Historical Digital Text Collections through Advanced AI methods

  • Rosa Filgueira

University of St Andrews

Keywords:  Digital Text Collections, Knowledge Graph, NLP-transformers

The National Library of Scotland (NLS) Data Foundry [1] offers a wide range of historical digital collections of textual resources, being one of the Encyclopaedia Britannica (EB) [2]. In this work, we present the application of advanced AI methods to unlock the full value of this collection.

The first part of our work involved the automated extraction of all EB terms (along with their metadata) across editions. To this end we employed defoe [3], a parallel processing library that enables the analysis of digital historical datasets [4]. Then, we created the EB Ontology [5] to represent the relations and properties between different editions, volumes, pages and terms. And used this ontology along with the extracted EB information to create the first version of the EB Knowledge Graph (EB-KG). 

The second part of this work consisted in augmenting the information stored in the EB-KG by applying a different set of Advanced-AI techniques. More specifically, we used Deep Transfer Learning to perform several Natural Language Processing (NLP) analyses tasks using state-of the art NLP-transformer models [7]. These models have enabled a number of advanced analyses that go significantly beyond simple search and retrieval. Some examples of these analyses are searching terms, finding similar terms, visualize clusters of terms, visualize all the metadata associated with all the terms, editions and volumes or fixing OCR errors, etc. 

During the final part of this work we created frances [8], a new web-application to interact with the EB-KG and to display the desired information. frances also allows us to perform further text-mining analyses on the EB-KG providing the results back to users, as well as to visualize them. 

This work enables users to accelerate the process of discovering insights from the Encyclopaedia Britannica, making it searchable, and analyzable without being distracted by the technology and middleware details. It also enables us to run and perform complex text analysis queries automatically and at scale, and to visualize the evolution of terms across.  So historians (and other communities) can easily conduct their research and obtain results faster, while all the large-scale NLP and text mining complexity is automatically handled in the background by frances. Furthermore, we have demonstrated the benefits of employing advanced AI-techniques (such as NLP-transformers and Knowledge Graphs) to understand and extract knowledge from digital text collections. 

Although we have used for this work the Encyclopaedia Britannica, this could be extended to handle, mine and analyse large digital collections effectively, with minimum changes to incorporate the necessary information into the Knowledge Graph. Or we could also re-use the information of collections from another existing Knowledge Graphs.

[1] https://data.nls.uk/ 

[2] https://data.nls.uk/data/digitised-collections/encyclopaedia-britannica/ 

[3] https://github.com/francesNLP/defoe 

[4] https://github.com/francesNLP/defoe/tree/master/defoe/nlsArticles/queries 

[5] http://w3id.org/eb/ 

[7] https://huggingface.co/docs/transformers/index 

[8] https://github.com/francesNLP/frances 


1Frances Wright (September 6, 1795 – December 13, 1852), was a Scottish-born lecturer, writer, freethinker, feminist, utopian socialist, abolitionist, social reformer, and Epicurean philosopher.

Understanding Uncertainty in Crowdsourced Digital History Projects : The Operation War Diary

  • Andrea Kocsis

The National Archives, UK

Key words: crowdsourcing, uncertainty, history

Abstract:

The paper aims to understand the different types of uncertainty in crowdsourced digital history projects and how to address them in multiple stages of crowdsourcing. To achieve this aim, the paper looks at the Operation War Diary (OWD) to differentiate between the occurrences of uncertainty during the life-span of the project starting from the creation of the documents through their annotation by volunteers to their visualisation and user interaction. History as a discipline acknowledges its limits within interpreting the sources. These approaches tend to agree that the interpretation provided by historians - despite making the most effort to stay true to the sources and their context - is a chosen narrative from the many. When digital methods come into the picture, there is a need for a digital data managing and analysing framework which enables the historian to handle the uncertainty embedded in the sources, rather than dismissing it and working only with the certain data leaving the false negatives containing useful information out of its database.


The aim is finding balance between what MacEachren called precision and accuracy (1992), or Earl Babbie named reliability and validity (1975). Both taxonomies differentiate between those two qualities of research which decide if the research runs methodologically correctly and/or reflects reality. Digital methods and automation tend to create an imbalance between precision (reliability) and accuracy (validity). They increase the former at the expense of the latter. The question becomes more complicated when the digital history project involves crowdsourcing, as this provides an additional step carrying the possibilities of human or technical error and misinterpretation.


The paper reviews the taxonomy and uncertainty models in the literature which are relevant to the OWD, secondly aims to provide its own model, and finally offers recommendations to provide both reliable and valid crowdsourced historical projects.