Big Language Data Comes with Big Opportunities and Big Challenges: A Learner-Corpus Case Study

Big language datasets allow researchers in various digital-humanities fields, such as corpus linguistics, to analyze language use in novel ways, using uniquely large and diverse samples. However, the scale of these datasets also creates new challenges for developing and using them.

Here, I present a case study on my development and use of such a dataset—the EFCAMDAT Cleaned Subcorpus—which contains over 700,000 texts, written by learners in an international online English school.

Specifically, I outline the issues that I encountered during the development process, which are common among such datasets, and show how I dealt with them using methods from the digital humanities. For example, I show how I identified duplicate content by measuring the overlap between texts using Hamming distance, how I identified non-English content using the cld2 classifier, and how I categorized texts based on their topic using LDA modelling.

In addition, I show how the scale of the new dataset enabled me to analyze language use with a sample that is much larger and more diverse than in past studies, in terms of factors such as the number of texts, learners, and tasks involved. Furthermore, I show how these factors enabled me to analyze language use in novel ways, for example by quantifying task effects using mixed-effects statistical models.

This case study provides important lessons that apply broadly across many types of big language datasets. Most notably, it demonstrates the need to thoroughly organize and clean such datasets, using quantitative and computational methods that can be implemented in a scalable manner. These lessons will help researchers in the digital humanities become aware of the specific opportunities that big language datasets offer, while also informing them of the associated challenges, and providing guidance on how to overcome those challenges.

Keywords: corpus linguistics; quantitative and computational methods; data curation