One of my last obligations with the Linguistic DNA project (though who knows what doors may open) was a short presentation on the “Public Sermons” collection as part of a workshop on Early modern preaching. This one-day conference was organised by a pair of postgraduate researchers, and brought together 30 or so scholars with a keen enthusiasm for the topic. It was a natural venue to share some of what we achieved modelling change with EEBO-TCP, and I was delighted that Tilly and Catherine (the organisers) found a space for this within a busy and collegiate programme.
In the run-up to the event, I was a little anxious. Some of the tools associated with Linguistic DNA’s public interface are still being refined, and I had sent off such a promising abstract that I couldn’t quite see how I might live up to expectation.
The data-driven approach taken by Linguistic DNA means that the results can be somewhat overwhelming. Indeed, there were points where the results generated threatened to overcome the available computer infrastructure. We wound up taking the very difficult decision to screen out information relating to 10 super-frequent nouns — those that appeared more than 1.5 million in times in EEBO-TCP. Top of that list was God. Imagine a collection of godless sermons! Early modern preachers would have a heart attack!
Even with this filtering (and we also omitted low frequency nouns, to screen out cases where there was insufficient data and achieve a reasonable return rate on queries), what is left can overwhelm. In an attempt to make sense of some of what is there, and to do that at scale (rather than narrowing in on whatever first grabs the interest), I decided to sample and compare the first 1000 pair results from different slices of data and according to different LDNA measures.
The “Public Sermons” collection is actually defined by two qualities: if the document’s title includes a form of the word sermon or homily, it qualifies for the collection; if the subject headings in the TCP metadata include the word sermon (most typically “Sermons: English”) it qualifies. The ‘Public’ label simply signifies that the stuff available is public domain–some of EEBO-TCP won’t enter the public domain until 2021.
Like other parts of the Linguistic DNA data, one can approach the Sermons collection as a whole or in twenty-year sections. I sampled both, and then sampled some other collections for good measure — the thirty-odd publications linked to Thomas Becon, the thousand or so ‘public’ items from Visualizing English Print’s SuperScience collection, and the data as a whole.
In each case, I took the first thousand results (a) when ordered by “Power“–the score that determines how strongly linked a word is to its partner (based on statistical frequencies); (b) when ordered by “Documents“–the quantity of different texts each containing at least one instance; and (c) when ordered by “Windows“, i.e. regions of text (+/-50 words) around the initial noun, indicating how central the pair seems to be to the documents’ content.
I imported my results into Tableau, a visualization and analysis tool that publishes to the web. (The legwork was plentiful copying and pasting, and some indexing in a spreadsheet.) Working with Tableau enabled me to explore the samples, comparing how prominent pairs in one dataset appeared (or did not appear) in others.
As an example, the pair “sermon london” is prominent at a document level in the seventeenth century–an epiphenomenon, I suspect, of more consistent title page formatting, when compared with early printed works (which may either have lost or never had a complete title-page, or may have noted the place of printing–or indeed preaching–only at the end, within a colophon). 1
(You can explore this phenomenon by decade in some data I snapped up earlier in my tableau experiments.)
The pair “body soul” appears in Science texts with a similar power of association as in sermons. When one appreciates that philosophy is counted among the sciences, this is less surprising.
My results–or perhaps better ‘experiments’–are available to browse on Tableau Public. I hope the Linguistic DNA documentation (explaining what’s what, and giving detail about decisions such as the omission of super-frequent words) will soon be available through the main interface as well. I may be able to advance copies of some material on request.
1. Beware the occasional LDNA gremlin. Visit Tableau Public and you’ll discover that london + sermon has no entries in 1680s sermons, though the obverse (sermon + london) does. There seems to be an interface delivery issue where the foremost result (in this case sermon + london) doesn’t display in the results table, creating some gaps that become high profile via my copy & paste routine.