Addressing Inherent Biases in Information Retrieval Systems of Digital Archives: A Multidisciplinary Study in Digital Archives of Holocaust Victims and Perpetrators

Diakopoulos (2018)1 stated that since algorithms rely on a quantified version of reality that only incorporates what is measurable as data, they can overlook much of the social context that would otherwise be essential in rendering an accurate decision. In particular, when we study historiography of a certain event, it is also important to acknowledge that data in digital archives are not evenly distributed across the actual entire categories. Since most datasets in scholarly digital archives are missing, hiding, less illuminated, or blurred, while some are highlighted, identifying prevailing algorithmic biases or other social contexts such as media tendencies or research trends can be challenging. Consequently, information retrievals rendered by machine-learning algorithms incorporates these biases. The inherent bias in scholarly digital archives can both directly and indirectly influence future trends of scholarly communication by affecting the results extracted by the user through the query. Thus, class-imbalanced datasets that are skewed differently towards majority groups or missing datasets in digital archives can affect the accuracy and performance of its information retrieval. I therefore propose a practical guideline and digital archive design for digital librarians and digital humanities scholars to improve the accuracy and performance of information retrieval in digital archives. Imbalanced classes in digital scholarly libraries and multimedia archives can be optimized with background information about the nature of the datasets and the corresponding archive characteristics. Digital archival studies with more sophisticated machine-learning algorithms and detailed explanations regarding the social context of a certain event can be combined to improve the trustworthiness of the resulting knowledge in digital libraries and media archives.

Keywords: Research trends analysis, Digital archival research, Class-imbalance learning

1 Diakopoulos, N. (2018). The Algorithms Beat. Data Journalism Handbook. Eds. Liliana Bounegru and Jonathan Gray.