Humanities scholars, and especially historians, work with large amounts of data that come from various sources and are often unstructured. Ingesting these data into an easy-to-use database that permits complex queries or visualisations is often unattainable. Based on experiences in the DigiKAR geohumanities project, the blog post addresses this challenge and presents workflows in which relational or graph databases are optional end-products rather than the starting points of the research process. These experiences can inspire other humanities projects to find low-maintenance alternatives to expensive multi-user databases with graphic user interfaces.
A challenge that humanities scholars from different disciplines face in their research, worthy of a baroque Latin title, is structured data collection. Lacking custom-made research infrastructures, most humanities scholars start collecting data in the form of spreadsheets but would prefer a more advanced database with convenient input and query options. Ideally, the data should be machine-readable as well as human-readable, easy to update and make different levels of reliability transparent. Among researchers who are not tech-savvy, databases are sometimes viewed as the Holy Grail – systems with miraculous powers that will help them solve all their problems of data collection and data analysis at once.
Building and maintaining databases, however, is complicated and often cost-intensive, and providing a user-friendly interface is not always possible. In this blog post, I argue that chronically under-funded humanities project teams need a research data management system that does not entirely depend on an elaborate multi-user database but provides structured data – which may at later stages be imported into different types of databases for data analysis.
In the DigiKAR geohumanities project, which I have co-coordinated since summer 2021, we are trying to develop exemplary workflows for working with historical spatial data that include a de-centralised and low-maintenance collection of data by different team members. The DigiKAR project (Digitale Kartenwerkstatt Altes Reich) is led by the Institute of European History in Mainz, Germany, and questions traditional cartographic visualisations of the Holy Roman Empire. Countering perceptions of the Empire as a “patchwork quilt” (a Flickenteppich) of multiple small territories, we highlight overlapping structures and trans-regional mobilities. For this purpose, we have chosen two regions of the early modern period as case studies: the Electorate of Mainz (which is covered by historians at JGU Mainz and EHESS Paris) and the Electorate of Saxony (covered by Falk Bretschneider, EHESS Paris).
The historians in the DigiKAR team are early modernists who often work with incomplete, uncertain or fragmented data contained in various sources that come in various formats. Apart from original and barely digitised archival material, we gather semi-structured data assembled by researchers in the first half of the twentieth century, and XML-encoded data available via API. XML is a mark-up language commonly used for digital editions, and archives across the world have begun to provide transcripts of original sources in this format. An API is a so-called “application programming interface” and permits us to download a large number of archival resources via script. For script-based data collection, the collaboration between “traditional historians” and digital historians is essential. Most historians in DigiKAR are regular users of digital tools with graphical user interfaces (GUI) but have limited coding and database experience.
Regarding content, the historians in DigiKAR are interested in the constructions of space through legal and political interactions as well as in biographic mobility. On the level of data queries, the historians’ requirements range from Boolean queries to complex graph-related queries.
Both make a high level of data cleaning and data normalisation or careful data mapping necessary, which often collides with the historians’ wish to preserve the complexity and ambiguity of their original sources. One of the aims of the DigiKAR project is to balance the historiographical and technical requirements by providing iterative and comparative data analysis. The (geo)visualisations contribute to this critical analysis as well as to the communication of project results.
As a starting point, we have developed several spreadsheet templates for our two thematic work packages, respecting their different foci. In the Mainz work package, which traces clerical and secular officials’ biographies, we are collecting event-oriented data according to the factoid approach developed at King’s College London. This means that we gather agency-related events (e.g. “grand tour”) and general life events (e.g. “birth” and “death”) as stated in different sources. Uncertainty or vagueness of the information is captured in a comments column, where we also add important source quotations.
In order to abstract from the redundant, sometimes contradictory factoid entries and construct more solid chronologies of events, we use Python scripts to compare and align the spreadsheet data (cf. our GitHub deployment under construction). Python is a general-purpose programming language that has become very popular in digital humanities as it is easy to learn and provides a large number of code packages that researchers can re-use. Python packages such as pandas are ideal for the analysis of data stored in tables. In the absence of a database, Python scripts can thus select and compare data. Our ultimate goal is to write a higher-level knowledge representation to RDF format for data queries that will eventually lead to more reliable visualisations of people’s mobility and of Mainz’s administrative structures. RDF is a model for data interchange on the Web and used to integrate data from multiple sources (also cf. concept of Linked Data). As raw data, RDFs are easy to transfer and store, and they can be flexibly integrated into graph databases. Even if project funding ends, RDF files can thus be made available for re-use and archived long-term.
In the Saxony work package, we have decided to approach spaces through the legal capacities associated with them, which can include religious rights, peace keeping, taxation, or political representation. In this approach, a “mill” is a “mill” because certain legally defined functions within a local community are linked with the place and the people who live and work there. Similarly to work package Mainz, these data will also be collected in spreadsheets, but as the focus of the Saxony case study is on place attributes rather than people’s networks, we intend to import those data into a relational GIS database for exploration and data enrichment early on. Therefore, our two work packages experiment with slightly different workflows, but both are marked by a flexible combination of (versioned) spreadsheets, script-supported data cleaning, and more solid data infrastructures.
Two examples of Python scripts that I have written for data review in the Mainz work package are an Entity Counter (used to count unique items across spreadsheets) and a Relationship Tracer (used to reconstruct missing family relationships based on the collected data). Both scripts are available in the DigiKAR GitHub repository and help us refine our data independently of any specific database.
While the import of large amounts of data into the graph database and the relational GIS database is therefore a medium-term goal which requires a lot of preparation, we are already experimenting with smaller sub-sets of our data, especially in collaboration with students. In the summer term of 2022, Bettina Braun (JGU Mainz) taught an MA seminar in history which focused on fourteen early modern cathedral provosts from Mainz. Their professional careers and mobility patterns already gave us important insights into the religious and political interconnections of Mainz in the 17th and 18th centuries. Several clerics in the data set studied abroad, went on the grand tour or took on diplomatic missions. Moreover, many clerics serving in the Mainz cathedral chapters also served in Speyer or Cologne at some point in their careers. One example of a mobile cleric in the analysed data set is Dietrich Kaspar von Fürstenberg who spent time in Florence and Rome as well as in Speyer, Cologne, Bohemia and several places in present-day Belgium and France. Results from the seminar (static maps and a zoomable map of individual biographies) will successively be published on our DigiKAR Projektseminar website.
Instead of delaying data exploration till the uncertain point in time when a fully-functional database is available, humanities projects generally should structure and analyse their data even while the data collection continues. Data exploration also in the form of data visualisations is essential to finding gaps and inconsistencies in the data that might otherwise be overlooked for too long. Also, data exploration is a way of testing the applied data model, which can still be adjusted in an early phase but is difficult to change when a complex data base has already been built. Spreadsheets are, of course, not the most comfortable tool to work with, but if data are collected in several clearer spreadsheets and only combined or migrated via script, a project can involve an arbitrary number of people in the data collection. Of course, versioning spreadsheets and creating automated back-ups is key.
Researchers interested in similar questions are welcome to contact us. We are aiming to organise community and networking events to exchange ideas and discuss potential collaborations.
Monika Barget Publishes blog on the DigiKAR geohumanities project | FASoS Weekly
[…] Monika Barget recently published an article for Mosa Historia, on her experiences working on the DigiKAR geohumanities project. […]
De Collectione Datorum – the challenges of developing data models and databases in humanities projects – INSULAE
[…] following blog post was initially written for Mosa Historia, the blog run by the Department of History at the University of […]