We develop tools and online services which use data mining and visualisation techniques to facilitate research into large text and image-based datasets

Text and image data mining is the process of using computer algorithms to reveal new patterns and trends across large datasets. Data visualisation is the process of displaying the algorithm’s results in such a way that the patterns and trends are self-evident and can be explored intuitively. A large dataset could be 15 billion words of newspaper texts or a collection of extremely high resolution manuscript images. Some of the data mining and visualisation techniques which we use are:

  • Automated entity recognition whereby a combination of dictionaries, grammars and pattern recognition techniques are used to automatically tag entities such as people’s names, place names, dates, terminology, concepts and different languages. Its advantage is in being able to tag objects within very large datasets much more quickly than a human can (but not necessarily with the same level of accuracy) and in enabling semantic search methods such as “show me all the women who are mentioned in the nineteenth century”.
  • Network analysis in which relationships between objects such as people, places, events and transactions are visualised graphically. Its key value is in showing relationships and other phenomena which are not made explicit or self-evident within the primary sources themselves. For example, the economy of toxicants and intoxication in the 18th century, from importer to retailer to consumer to legislator, can be tracked through unrelated documents.
  • Variant analysis in which multiple versions of the same text are compared in order to identify differences in vocabulary, grammar and spelling. Its key value is in enabling scholars to understand how texts which have been transmitted via a scribal/written or oral tradition relate to each other or to understand the evolution of a text which has been revised multiple times. Variant analysis techniques have a wider application in fields such as plagiarism detection.
  • Cluster analysis in which a document or collection is represented as a set of groups, with each group reflecting a particular set of shared characteristics. For example, the use of clustering to represent a linguistic analysis of Chaucer’s manuscripts of The Canterbury Tales might show that the manuscript Hg has more spellings occurring in group A than group B and that it therefore might have a historical relationship with the other manuscripts which fall within group A.
  • GIS (Geographic information system) in which data is plotted onto a map as points or chloropleths (shaded or patterned regions). Different types of GIS include: Historical GIS, whereby a collection of documents or multiple datasets are plotted onto a historical map (often using an underlying map engine such as GoogleMaps) in order to reveal spatial patterns and trends; Perceptual Dialectology, whereby end-users listen to audio recordings of dialectal speech and then identify the geographical region where they believe that the dialect is spoken.
  • Image analysis in which computational algorithms are applied to one or more images to identify and compare attributes such as colour, shape and size against others in a collection or an exemplar. The Digging into Image Data project used such techniques to look at the question of authorship across diverse collections of images by comparing surface areas of the Great Lakes in maps, colour schemes in quilts and to build digitial footprints of a scribe.