Earlier this month, I made the 6-hour rail journey to Plymouth to participate in the Digital Research in the Humanities and Arts conference, dataAche. I was there to participate in a panel organised by Gabriel Egan, around the theme of “the author’s unseeing eye”.

The well-trained eye will spot a strong alignment between my dataAche abstract (appended) and work presented to SHARP earlier in the summer. Continuing my work with EEBO-TCP’s regrettably messy metadata, I finally feel like I may have a handle on what we (Linguistic DNA) can do to model context-at-scale.

I’ve also made a new visual to quickly map the difference between EEBO, TCP, and EEBO-TCP:

Venn diagram showing the overlap between EEBO and TCP.
The relationship between EEBO and TCP. Diagram copyright (c) I. C. Hine / Linguistic DNA.

And another to show the task I’ve been working on, i.e. that of breaking apart EEBO-TCP in some meaningful ways.

Breaking up EEBO-TCP (mauve circle being broken up into pieces).

It was a particular pleasure to be speaking alongside Alan Hogarth, the researcher responsible for putting together VEP’s Super-Science collection—a subset of EEBO-TCP developed to enable scrutiny of science writing. More on that soon over on the LDNA blog.

In the meanwhile, some Twitter feedback:

Tweet from English at Plymouth Uni with photo from presentation and "taking spirited presentation to a new high".

Linguistic DNA, dirty data and messy metadata: an exercise in catalogue genetics

Since 2015, the Linguistic DNA team has been developing methods for mapping meaning and change-in-meaning in Early Modern English. Our work begins with the hypothesis that meanings are not equivalent with words, and can be invoked in many different ways. For example, when Early Modern writers discuss processes of democracy, there is no guarantee that they will also employ a keyword such as democracy. We adopt a data-driven approach, using measures of frequency and proximity to track associations between words in texts over time. Strong patterns of co-occurrence between words allow us to build groups of words that collectively represent meanings-in-context (textual and historical). We term these groups “discursive concepts”.

The task of modelling discursive concepts in textual data has been absorbing and challenging, both theoretically and practically. Our main dataset, transcriptions of texts from Early English Books Online (EEBO-TCP), contains more than 50 000 texts. These include 9000 single-page broadsheets and 162 volumes that span more than 1000 pages. There are 127 items printed pre-1500, and nearly 7000 from the 1690s. The process of analysis therefore requires us to think carefully about how best to control and report on this variation in data distribution.

One particular question that has arisen affects all who attempt to use EEBO: what is in it? To what extent is its material from pre-1500 similar in kind (genre, immediacy, etc.) to that of the messy 1550s (as the English throne shifted speedily between Edward VI and his siblings), the 1610s (era of Shakespeare and the King James Version), or the 1640s (when Civil War raged)? This paper is a sustained reflection on attempts to find out “What’s in EEBO?”, incorporating a study of its accompanying catalogue metadata that regrettably reveals more about the catalogue authors than the data contained.

Leave a Reply