How not to read texts: giving context to big data

Computational techniques enable humans to seek out patterns in collections of texts that exceed what one human can read. This permits the identification of textual and linguistic phenomena that may otherwise defy human recognition. It first requires texts in suitable digital format, texts that are “machine readable”. However, the use of the verb “read” to describe the discrete activities of human and machine can mask considerable difference between the two audiences’ needs.

Whether output as keywords-in-context, lists of associated terms, or mediated by visualisations, many of the mechanisms that make managing large language datasets possible for humans to explore simultaneously limit engagement with the words’ contexts. If words are known by the company they keep, computational resources (and perhaps the user’s patience) set unseen boundaries around the company interrogated. 

Emerging from a collaborative project that seeks to identify and trace the movement of paradigmatic terms in early modern English, this paper will consider different ways of moving from the products of machine reading to the work of human reading (and back again), weighing up their strengths and weaknesses in the context of this work.  The paper will respond to questions such as:

 

• What may be gained, lost, (or simply hidden) when historical texts are prepared for computational analysis? 

• What do the different audiences (computers, humanities scholars) not read?

• How can humanities scholars test claims about collections of texts that are too big to read?

• How does one systematise close reading? What checks and balances may be employed when investigating large datasets?


Reflections will be grounded in recent work with early modern English text collections, including Early English Books Online (EEBO) and Eighteenth Century Collections Online (ECCO), making reference to corpus linguistics, distributional semantics, and literary and historical studies.