Session 4 Thursday 16:00 - 17:30 High Tor 2 Chair: Michael Pidd
Modelling Eighteenth-Century Epistolarity: Unsupervised Classification of the Voltaire Correspondence Glenn Roe , Clovis Gladstone Letters, by their very nature, should be almost ideal candidates for unsupervised classification using the Latent Dirichlet Allocation (LDA) topic modelling algorithm. Specifically, given the size of the documents and the wide array of topics discussed therein, LDA should be able to provide a general overview of the various topics discussed in the 22,000-letter correspondence of Voltaire, a collection that has yet to be fully indexed. However, due perhaps to the formal nature of 18th-century letter-writing, which in French includes a preponderance of formules de politesse and other formulaic expressions, the topics detected are often too general in nature (i.e., concerned with the act of letter-writing itself), or conversely, too limited in terms of overall topic distribution. Indeed, the trade-off between large general topics (e.g., ‘health and wellbeing’, ‘politics’, ‘religion’) and those that are more content-specific (e.g., ‘the Calas affair’ or the ‘Seven Years’ War’) yet sparse, remains a challenge for LDA. To address these issues, we aim to evaluate LDA’s fitness-for-task by comparing the model output of several lesser-known (at least in terms of digital humanities coverage) unsupervised algorithms in order to gage if these approaches might be better suited at capturing the complexity of eighteenth-century epistolary collections. These algorithms include Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA) and Probabilistic LSA (pLSA), and an implementation of LDA and the word2vec algorithm. By modelling Voltaire’s ‘epistolarity’ we aim, on the one hand, to gain a better understanding of the discursive makeup of his massive correspondence, its most important topics, and their distribution and evolution over a more than 70-year span. On the other hand, by bringing different algorithms into productive conversation with our texts and methods, we offer a necessary critique of LDA’s prominence as the de facto unsupervised learning method in the digital humanities today.
Visualizing Literary Style: The Case of Milton, Bunyan, and the Bible Harvey Quamen University of Alberta This paper tracks literary influence and shared literary styles using data visualizations. The case study here is a comparison of the surprising literary similarities between two seventeenth-century texts, John Milton's Paradise Lost (1667) and John Bunyan's The Pilgrim's Progress (1678), texts composed by two authors at opposite ends of the educational and literary spectrum. I conclude that it is possible to represent aspects of literary style visually, and that the resulting graphs can point us to new understandings of both style and literary influence. Both texts in my case study are religious: while Milton's details the Old Testament story of Genesis, Bunyan's is a New Testament allegory of Christian faith. Despite their topical similarity, there is no substantive evidence that Bunyan read Milton's poem (indeed, Bunyan was in prison when Milton's poem was published). My hypothesis is that both writers are responding independently to available English editions of the Bible. But to what degree? And how might we detect stylistic signals that lie underneath the obvious similarities of theme and diction? My method here is to understand literary influence not through the word choices made by these writers but instead through the rhythms of grammatical sentence structures. Part-of-speech tagging allows us subsequently to compare grammatical structures via sequence alignment techniques drawn from biological genome studies. As with DNA, we can find common sequences shared by different sentences and we can map those relative similarities on a scatter plot where each dot represents a sentence and where greater grammatical similarity is signified by dots that are closer together. A second technique detects the placement of verbs within a sentence (a signature feature of what has been called "periodic” or “grand” style), and we can easily demonstrate that, while both Bunyan and Milton have borrowed from the Bible, Bunyan's style is more faithful to the original whereas Milton's style is more periodic.
Developing the Oxford English Dictionary as a toolkit for DH research James McCracken Oxford University Press We present some approaches to developing the OED as a resource to support the large-scale study of historical documents in English. The OED has had a fitful relationship with Digital Humanities research: although many recognize that in principle the OED contains information useful to the analysis of historical text, in practice this information has been difficult to parse and extract as complete and consistent data. Nevertheless, in recent years OED has a worked with a number of DH projects to build bespoke data sets to meet particular needs. By reviewing these projects, we can develop a more general model of OED services for DH. These include variant-spelling data for normalizing and lemmatizing historical and non-standard text, sense inventories for disambiguation, and query-term expansion for concept-based corpus search. We look at the practicalities of turning dictionary content into simple API functions that can plug into the wider infrastructure of textual research. We also identify the limitations of this approach: areas where historical lexicography, as traditionally conceived, is not well- aligned with DH needs, and where we therefore need to build out data in new directions. Partnerships within the DH community are helping us to tackle these challenges. As just one resource among many for DH scholars, the OED can make itself useful as a suite of services focussed on ease of use, problem-solving, reliability, and interoperability.

Session 4

Thursday 16:00 - 17:30

High Tor 2

Chair: Michael Pidd

Modelling Eighteenth-Century Epistolarity: Unsupervised Classification of the Voltaire Correspondence

Glenn Roe ,
Clovis Gladstone

Letters, by their very nature, should be almost ideal candidates for unsupervised classification using the Latent Dirichlet Allocation (LDA) topic modelling algorithm. Specifically, given the size of the documents and the wide array of topics discussed therein, LDA should be able to provide a general overview of the various topics discussed in the 22,000-letter correspondence of Voltaire, a collection that has yet to be fully indexed. However, due perhaps to the formal nature of 18th-century letter-writing, which in French includes a preponderance of formules de politesse and other formulaic expressions, the topics detected are often too general in nature (i.e., concerned with the act of letter-writing itself), or conversely, too limited in terms of overall topic distribution. Indeed, the trade-off between large general topics (e.g., ‘health and wellbeing’, ‘politics’, ‘religion’) and those that are more content-specific (e.g., ‘the Calas affair’ or the ‘Seven Years’ War’) yet sparse, remains a challenge for LDA.

To address these issues, we aim to evaluate LDA’s fitness-for-task by comparing the model output of several lesser-known (at least in terms of digital humanities coverage) unsupervised algorithms in order to gage if these approaches might be better suited at capturing the complexity of eighteenth-century epistolary collections. These algorithms include Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA) and Probabilistic LSA (pLSA), and an implementation of LDA and the word2vec algorithm. By modelling Voltaire’s ‘epistolarity’ we aim, on the one hand, to gain a better understanding of the discursive makeup of his massive correspondence, its most important topics, and their distribution and evolution over a more than 70-year span. On the other hand, by bringing different algorithms into productive conversation with our texts and methods, we offer a necessary critique of LDA’s prominence as the de facto unsupervised learning method in the digital humanities today.

Visualizing Literary Style: The Case of Milton, Bunyan, and the Bible

Harvey Quamen

University of Alberta

This paper tracks literary influence and shared literary styles using data visualizations. The case study here is a comparison of the surprising literary similarities between two seventeenth-century texts, John Milton's Paradise Lost (1667) and John Bunyan's The Pilgrim's Progress (1678), texts composed by two authors at opposite ends of the educational and literary spectrum. I conclude that it is possible to represent aspects of literary style visually, and that the resulting graphs can point us to new understandings of both style and literary influence.

Both texts in my case study are religious: while Milton's details the Old Testament story of Genesis, Bunyan's is a New Testament allegory of Christian faith. Despite their topical similarity, there is no substantive evidence that Bunyan read Milton's poem (indeed, Bunyan was in prison when Milton's poem was published). My hypothesis is that both writers are responding independently to available English editions of the Bible. But to what degree? And how might we detect stylistic signals that lie underneath the obvious similarities of theme and diction?

My method here is to understand literary influence not through the word choices made by these writers but instead through the rhythms of grammatical sentence structures. Part-of-speech tagging allows us subsequently to compare grammatical structures via sequence alignment techniques drawn from biological genome studies. As with DNA, we can find common sequences shared by different sentences and we can map those relative similarities on a scatter plot where each dot represents a sentence and where greater grammatical similarity is signified by dots that are closer together. A second technique detects the placement of verbs within a sentence (a signature feature of what has been called "periodic” or “grand” style), and we can easily demonstrate that, while both Bunyan and Milton have borrowed from the Bible, Bunyan's style is more faithful to the original whereas Milton's style is more periodic.

Developing the Oxford English Dictionary as a toolkit for DH research

James McCracken

Oxford University Press

We present some approaches to developing the OED as a resource to support the large-scale study of historical documents in English. The OED has had a fitful relationship with Digital Humanities research: although many recognize that in principle the OED contains information useful to the analysis of historical text, in practice this information has been difficult to parse and extract as complete and consistent data.

Nevertheless, in recent years OED has a worked with a number of DH projects to build bespoke data sets to meet particular needs. By reviewing these projects, we can develop a more general model of OED services for DH. These include variant-spelling data for normalizing and lemmatizing historical and non-standard text, sense inventories for disambiguation, and query-term expansion for concept-based corpus search. We look at the practicalities of turning dictionary content into simple API functions that can plug into the wider infrastructure of textual research.

We also identify the limitations of this approach: areas where historical lexicography, as traditionally conceived, is not well- aligned with DH needs, and where we therefore need to build out data in new directions. Partnerships within the DH community are helping us to tackle these challenges. As just one resource among many for DH scholars, the OED can make itself useful as a suite of services focussed on ease of use, problem-solving, reliability, and interoperability.

DHC 2018 Click here to register