Unlocking Historical Digital Text Collections through Advanced AI methods

Keywords:  Digital Text Collections, Knowledge Graph, NLP-transformers

The National Library of Scotland (NLS) Data Foundry [1] offers a wide range of historical digital collections of textual resources, being one of the Encyclopaedia Britannica (EB) [2]. In this work, we present the application of advanced AI methods to unlock the full value of this collection.

The first part of our work involved the automated extraction of all EB terms (along with their metadata) across editions. To this end we employed defoe [3], a parallel processing library that enables the analysis of digital historical datasets [4]. Then, we created the EB Ontology [5] to represent the relations and properties between different editions, volumes, pages and terms. And used this ontology along with the extracted EB information to create the first version of the EB Knowledge Graph (EB-KG). 

The second part of this work consisted in augmenting the information stored in the EB-KG by applying a different set of Advanced-AI techniques. More specifically, we used Deep Transfer Learning to perform several Natural Language Processing (NLP) analyses tasks using state-of the art NLP-transformer models [7]. These models have enabled a number of advanced analyses that go significantly beyond simple search and retrieval. Some examples of these analyses are searching terms, finding similar terms, visualize clusters of terms, visualize all the metadata associated with all the terms, editions and volumes or fixing OCR errors, etc. 

During the final part of this work we created frances [8], a new web-application to interact with the EB-KG and to display the desired information. frances also allows us to perform further text-mining analyses on the EB-KG providing the results back to users, as well as to visualize them. 

This work enables users to accelerate the process of discovering insights from the Encyclopaedia Britannica, making it searchable, and analyzable without being distracted by the technology and middleware details. It also enables us to run and perform complex text analysis queries automatically and at scale, and to visualize the evolution of terms across.  So historians (and other communities) can easily conduct their research and obtain results faster, while all the large-scale NLP and text mining complexity is automatically handled in the background by frances. Furthermore, we have demonstrated the benefits of employing advanced AI-techniques (such as NLP-transformers and Knowledge Graphs) to understand and extract knowledge from digital text collections. 

Although we have used for this work the Encyclopaedia Britannica, this could be extended to handle, mine and analyse large digital collections effectively, with minimum changes to incorporate the necessary information into the Knowledge Graph. Or we could also re-use the information of collections from another existing Knowledge Graphs.

[1] https://data.nls.uk/ 

[2] https://data.nls.uk/data/digitised-collections/encyclopaedia-britannica/ 

[3] https://github.com/francesNLP/defoe 

[4] https://github.com/francesNLP/defoe/tree/master/defoe/nlsArticles/queries 

[5] http://w3id.org/eb/ 

[7] https://huggingface.co/docs/transformers/index 

[8] https://github.com/francesNLP/frances 


1Frances Wright (September 6, 1795 – December 13, 1852), was a Scottish-born lecturer, writer, freethinker, feminist, utopian socialist, abolitionist, social reformer, and Epicurean philosopher.