Tools for EEBO-TCP & the challenge of reproducibility

After a gap in posts, this is a somewhat epic effort, following up on a “Language and Society” seminar with University of Sheffield History students this morning. Alongside an overarching interest in “reproducibility”, it contains:

A description and tips for EEBO-TCP tools including the main Chadwyck interface, CQPweb, EarlyPrint and SketchEngine.
Some other interesting tools.
A brief “how to” for node + collocate searches (enabling exploration of bigger windows).

Reproducibility in the Humanities

When medical trials are designed, administered and reported, it is apparent that the exact circumstances need to be recorded because these can have a critical impact on the efficacy of whatever is being trialled. If something is effective, knowledge about how to use it will be important. In Humanities research, the reproducibility of our results may not be a matter of life and death. But it is still good practice, and it requires clarity.

If, given the same dataset, I cannot replicate your findings, this may be because there was something wrong with how you carried out the research—or because there’s something wrong with how I did. But unless the methodology is carefully communicated it may be difficult to know who has made a mistake, or whether we just asked different questions.

Work with EEBO-TCP is especially perilous when it comes to reproducibility, because scholars can be poor at recording where, how and when the research was done¹—and these details can have a surprisingly important impact on what you find. Two scholars can conduct research based on EEBO-TCP but get different results because they are de facto using different datasets.

This post was edited on 11 April 2017 in recognition that SketchEngine have confirmed the composition of their TCP dataset and updated its accompanying information.

Chadwyck

In a British University, it is normal to encounter EEBO-TCP via the Chadwyck interface. Here the fullness of Early English Books Online is united with the subset of “fully keyed” texts produced by the Text Creation Partnership (EEBO-TCP). Because Chadwyck is a commercial product, it is updated with new batches of TCP work as these are released and those in UK Universities will normally find they have access to these new releases.² When I last checked (late October), there were over 60,000 “fully keyed” texts in Chadwyck’s EEBO, following a September 2016 update.

Tip: The current number of “keyed full texts” appears in the drop-down box where you can “limit” your search to just these documents. If you need to know how many such documents there are in a specific time-span, combine the date restrictions with a full text only restriction.

Although the standard Chadwyck interface doesn’t facilitate the granular quest for particular parts-of-speech, its (relatively) new “variant forms” check-box means that a search for “addict” will return texts with “addicted”, “addicting”, etc. without the need for a wild-card search.³

CQPweb

Andrew Hardie, a corpus linguist based at Lancaster University, designed the CQPweb interface to process standard linguistics queries for different text collections (corpora). Again access to the EEBO-TCP data is dependent on your institution’s subscriptions. In principle, Hardie’s “Early English Books Online (V3)” corpus includes about 44,000 documents—the subset of TCP that was available when Hardie last prepared it for corpus queries. The preparation includes spell-checking with VARD (to compensate for some irregularities of early modern spelling) and automatically determining each word’s part-of-speech (is it a noun? a verb?). Inevitably this is an imperfect procedure, but it makes complex granular queries possible.

Tip: By default, search results are delivered in a KWIC concordance view. That results page also connects with further queries (via the dropdown box) including “distribution” over time (viewed as numbers or transformed to a bar chart) and “collocates” (words that appear near the search term repeatedly).

N.B. Growth in frequency may just be a symptom of overall growth in texts that were printed, have survived, and been transcribed; CQPweb displays an IPMW figure (instances per million words) so you can see relative frequencies, but in the early sixteenth century (all) words that occur will have a high IPMW because there are few words in the dataset. (Overall CQPweb’s EEBO v3 has 1.2 billion words.)

EarlyPrint

The Early Print interface provided by Washington University, St. Louis, includes a compelling visual illustrating the extent to which the volume of texts transcribed (EEBO-TCP) is proportionate to the number of publications documented in the Short Title Catalogue across the period to 1700. Its N-Gram browser offers a quick method of comparing the distribution of terms and phrases across time, facilitating a quick survey of trends. This can be used to show how spelling changes (hence the default example: love, loue) or examine how word and phrase use rises and falls.

Tip: Pay attention to what you are searching for. A search of the original will be affected by changes in spelling. A lemma search will cover more uses (but can conflate nouns and verbs unless you are specific). Start with unigrams and pick a set of terms (e.g. civil, honest, valour, virtue) to experiment with the different query parameters.

The “Key Words in Context” search yields a concordance view, ordered chronologically. Although the context window shown is always small, you can use the metadata to find the full text via one of the other interfaces. It’s easy to see how heavy the use of a term is in different texts, and across authors.
But which version of EEBO-TCP is this? The documentation is silent, though a blank search suggests there are 48,000 documents here. When comparing searches carried out here with those on CQPweb, you should also note that EarlyPrint’s part-of-speech tags are derived from NorthWestern University’s “NUPOS” system, and its spelling regularisation is also conducted with NorthWestern’s MorphAdorner pipeline (not VARD).

SketchEngine

Since Phase I of EEBO-TCP (ca. 25 000 texts) became public domain in January 2015, it has been incorporated into more tools. The 32, 844 documents contained in SketchEngine’s Historical collection reflect Phase I, supplemented by ECCO-TCP and EVANS-TCP which are also public domain.⁴ (In this case it is the tool rather than the data that requires a subscription.)

Among the facilities SketchEngine offers is a Thesaurus function that identifies similar words by comparing the contexts in which they occur. (Among words similar to “foreigner” are “stranger”, “citizen” and “European” because these words share collocates.) To get a good measure of similarity, the search term will need to occur frequently; the results are thus impaired by the absence of any spelling regulariser because there are less occurrences with which to build the lists of collocates.

Thesaurus function on Sketch Engine showing results for foreigner in their "EEBO" corpus — SketchEngine results for “foreigner” in their (so-called) Early English Books Online corpus.

The WordSketch tool is useful for getting a quick impression of how a word is used, though be aware that it is reliant on successful part-of-speech analysis and that can be hit-and-miss on some fifteenth- and sixteenth- century texts because the syntax is less typically ‘modern’. (Most POS analysis tools are first trained with modern English.)⁵

Tip: Using the “sort” options in the lower left-hand menu you can order your concordance results alphabetically based on the words immediately to the right or left. This is a typical corpus linguistics way of arranging texts to see and examine patterns around your search term (aka the “node”).

Too many choices, too little time?

As discussed in the seminar, you will find different tools useful for different things—and each of the tools we’ve introduced to you is good for some research tasks. It would be perfectly reasonable to feel bewildered by opportunity or terrified by technicalities after today’s first encounter. (Making good scholars sometimes involves introducing you to the worst of a good thing.) But if you’re brave enough to carry on in the world of so-called “distant reading”, there’s good news too:

Most metadata will be good; the earliest “addict” is an unfortunate example (and a flaw that only affects its Chadwyck metadata), so you don’t need to be too sceptical. But double-check if and when a text becomes a critical point in your argument.
The virtue of the tools is that you can get a broader view than is possible to gain from close reading; approaching with care, you can also expect to identify phenomena you would not have noticed when reading close up.
The tools work best with high frequency items because there’s enough information to form good conclusions. (This is why you may spot less flaws with a word like “virtue” than “addict”.)
And remember: the data changes, and just as EEBO-TCP is not “representative” (something covered in an ongoing series of LDNA blog-posts), the subsets of it that happen to be available in different tools are accidents rather than designed representatives of the whole.

The important takeaway is: always record what you queried, how you queried it, when and where (on what site). That way you can avoid confusing yourself and others, and make your own work more credible.⁶

And remember, computers can’t read. That’s your job!

More tools?!

Well done for making it this far! Here are some other things you might find interesting:

The Historical Thesaurus of English. Based on OED definitions and supplemented by other data (especially for Old English), the HT offers a way into CONCEPTS that can help you supplement a starting list of query words.
Visualizing English Print. A joint project between Strathclyde and Wisconsin, with a special interest in literature and drama offers machine-friendly plain text subsets of EEBO-TCP covering particular themes (drama, science) and a random sample of 40 texts per decade (closer to a designed linguistic corpus). You can upload such a corpus to SketchEngine or try out AntConc.

And of course:

Linguistic DNA. Posts of particular interest include the early reviews of VARD and MorphAdorner (tools used to tackle spelling variation), and the ECCO OCR v TCP analysis (a good reminder of why the Artemis Ngram graphs may prove misleading).⁷
In future, LDNA should also provide you with tools you can use to explore bigger spaces and examine discursive concepts (as the POLITENESS example given in class).

And finally, the promised brief “how to”:

How to search for a node‘s cooccurrences with a specified collocate within a window of N words to right and left:

Chadwyck:	node NEAR.N collocate
CQPweb:	node <<N>> collocate
EarlyPrint:	“node collocate“~N

You can also perform directional searches in two of the tools. The following instructions will find instances where the collocate occurs to the right of the node:

Chadwyck:	node FBY.N collocate
CQPweb:	node >>N>> collocate

Tip: Where N (the window radius) is greater than 5, it may be more appropriate to refer to it as a cooccurring term or simply “cooccurrence” rather than a collocate.

Notes

1 See Tim Hitchcock’s comment: “I have yet to see a piece of academic history that is explicit about its reliance on keyword search and electronic sources. As editors and authors, we accept and write footnotes that misrepresent the research process.” Via Hitchcock, ‘Confronting the Digital: Or How Academic Writing Lost the Plot’, Cultural and Social History, 10, 1 (2013), 25–45 (18).

2 Jisc negotiated this access on behalf of UK research institutes; it is rather better than the access available at many non-UK institutions, where access is dependent on current subscription options.

3 This is not to decry the value of wildcard searches. The detail of what you enter into the search can have a profound impact on what you get out, and by learning how characters can be used to shape your search, you gain more control over your research questions. (See the Wildcard guide in your handout for more information.)

4 Observing the quantity of texts and their chronological spread, the original version of this post had deduced the composition of the dataset, then designated simply as EEBO by SketchEngine. As of 11.04.2017, and with only slight prompting, SketchEngine have confirmed the composition and given it a new accurate designation and dedicated information page.

5 You can also combat some of the shortcomings of the existing corpus by uploading your own data directly to SketchEngine, though this may feel like an advanced step at present.

6 With the added benefit that this will make your scholarship more credible—and more cutting edge—than much of what is commonly published; on which, see Tim Hitchcock’s comment in note 1 above.

7 We didn’t look at Eighteenth Century Collections Online in class, though you have some information in the handout about the Artemis interface Gale recently introduced. There are some interesting Artemis tools and it is worth using. However, it pays to be cautious about interpreting what you see. In addition to the LDNA recommended reading, check out Patrick Spedding’s article on ECCO and the history of the condom.