Skip to content

Hæfde ða gefohten foremærne blæd
Iudith æt guðe, swa hyre god uðe,
swegles ealdor, þe hyre sigores onleah.

More Mapping Experiments

In this post, I want to present some experiments with mapping manuscripts containing texts related to Judith that circulated among Anglo-Saxons. Several months ago, I had worked up a table of the raw data for mapping, but then set it aside. Recently, I took up the data again, revised it, and decided to start experimenting with different mapping tools to see which ones worked, what issues I encounter, and to think through what and how exactly I want to map the data. In general, I want to map manuscripts at their supposed origin points (more on that below), as well as their movements in time and space, to be able to visualize clusters of key points overall and at particular points in time. For example, if I were to walk into the library at Worcester in the year 1000, what books containing texts related to Judith might I find on the shelves, or in the scriptorium? It’s this sort of question, as well as larger questions about visualizing manuscript data, that I have in mind.

So here are a few of my visualizations. All of the following maps are based on a much more robust set of raw data, which can be accessed via a Google spreadsheet here. For the dates, places of origin, and places of provenance, I have relied on both N. R. Ker’s Catalogue of Manuscripts Containing Anglo-Saxon (Oxford, 1957) and the more recent bibliographical reference by Helmut Gneuss and Michael Lapidge, Anglo-Saxon Manuscripts: A Bibliographical Handlist of Manuscripts and Manuscript Fragments Written or Owned in England up to 1100 (Toronto, 2014)–references to which I include in the data table. Also notable is the trove of descriptions and suggestions in The Production and Use of English Manuscripts 1060-1220, although Gneuss and Lapidge do account for this project.

After playing with different tools, I decided to focus on CartoDB, because it is very user friendly, it had several options for visualizing that I felt worked well (enough) for what I want to do, it liked my data, and it led to interesting results. It’s not the perfect tool for what I eventually want to map, but it is a good place for my data to reside for now.

So here is one map using it.

Click to open dynamic map.

This map depicts two sets of data points: 1) the origins of manuscripts containing texts within the project corpus (which should be interpreted as somewhat fuzzy, based on suggested provenances); and 2) the modern-day libraries in which those manuscripts are held. The two sets of data are by default depicted together, but the “Visible Layers” option allows viewers to choose which data to present. Of course, more interactivity would be ideal, but that just is not possible with the (free) version of CartoDB I used for this. Unfortunately, there is no way to link the two data points–to visualize lines drawn from origin points to modern-day library locations for each manuscript–but perhaps that’s possible with another tool, or in the future.

Another issue I encountered was the problem with “fuzzy” locations. For example, several of these manuscripts are marked by Gneuss and Lapidge with “?”–naturally indicating that this is a suggestion, but not definite. In the case of the provenance of other manuscripts, new scholarship could certainly shift assessments. So, for some manuscripts, “England” or “France” is indicated, or other regions (“SE England,” “S England,” “NE France,” etc.), but the only way to map those is to provide a generic data point somewhere in the center of England or France–which doesn’t depict accuracy as much as it does an aggregate. Even locally, there is the issue between Christ Church or St. Augustine’s in Canterbury–if we know a manuscript was at Canterbury, how do we determine which of the two specific locations? How do we distinguish? How do we show data that we’re fairly sure about rather than data that is only suggestive? The map, in other words, should be viewed with caution, and the raw data should be consulted for more details.

Here is another way to present this data, focused only on the probable or suggested origins of these manuscripts, with clusters indicating the more saturated locations:


Click to open dynamic map.

The clusters do a nice job of quantifying the hot points, although, again, fuzzy points should be kept in mind. For example, some of the clusters in the center of England aren’t for specific libraries (you might be wondering what major manuscript centers you’ve missed out on there!), but they’re generic data points for general origins like “England” or “Mercia.”

Finally, none of these maps allow for visualizing time along with the data. There are tools that will allow that, but I haven’t yet found the right tool. For now, these help to conceptualize where I want to go. Hopefully working with the data in an actual mapping tool has helped me sort out some of my future goals and the difficulties of visualizing complex, multi-dimensional data.

Discovery: Alcuin’s Judith

Portrait of Alcuin, courtesy of and the Bibliothèque nationale de France.

A few weeks ago, I was shocked (shocked!) to experience another discovery. Not that I expected this project to stagnate, that all discoveries were behind me, that I would only be concerned with analysis from here on out–none of this is true–but it was the nature of the discovery that shocked me: I found yet another text for my corpus, by the Anglo-Saxon scholar Alcuin.

Here’s the story. I have taken another methodological approach to my project, and have been examining how Anglo-Saxons use quotations from the biblical book of Judith, which is slightly different from the compiling, editing, and text-mining that I’ve done so far. Part of this is my desire to look at the subject from a number of different angles, to explore many different facets of the central question about Judith in Anglo-Saxon England. In my recent approach, I’m concerned with the text of the biblical Judith, mainly the versions that Anglo-Saxons consulted and how they used it–for example, if they could read Greek, did they consult the Septuagint? Old Latin translations of the Septuagint? Jerome’s Vulgate? And, since there were various forms circulating throughout the medieval period, which recensions of the Vulgate?

For questions like these, the standard in the field is one book: Richard Marsden’s The Text of the Old Testament in Anglo-Saxon England (Cambridge, 1995). I was rereading this book, and I found that Marsden does briefly discuss Judith in several places, but there is much more room to pursue my subject. While Bede quotes from Judith a few times, this has hardly been noted (only a few entries in Fontes Anglo-Saxonici indicate correspondences); similarly, Aldhelm quotes Judith (from a Latin translation of the Septuagint), and Marsden gives a brief treatment. Some of these quotations are not included in my corpus, since not all of them are concerned with Judith, but only use the words from the biblical book for other aims. This is the difference between Bede and Aldhelm: Bede quotes, but does not cite, and his uses do not show direct interest in Judith but in the words as scripture; Aldhelm provides extended discussion of Judith as a figure and exemplar, embedding his quotation within this in his prose De uirginitate.

Reading Marsden’s work, I discovered another use of Judith, wholly unknown to me before, in a collection by Alcuin known as De laude Dei (probably compiled c.790-93).* This was an exciting discovery. This florilegium of biblical, patristic, poetic, and liturgical texts survives in two manuscripts: Bamberg, Staatsbibliothek, Msc. Patr. 17 (s. xi, Mainz), folios 133r-162r; and El Escorial B.IV.17 (s. ix3/4, S France), folios 93r-108r. Yet it has never been published. This led me to the work of Donald A. Bullough,** who was working on an edition when he passed away, and David Ganz, who seems to have picked up Bullough’s work.*** Fortunately, the Bamberg manuscript is digitized and online here, and I’ve been able to consult it. Alcuin dedicates a section of his collection to selections from Judith (titled “De libro Iudith”), and this has yielded some interesting intersections with my other material for this project (more to come when my article is completed). Although there’s no evidence that Alcuin’s text circulated in Anglo-Saxon England (apparently neither of the two manuscripts originates from or made it to England), this is still a major instance of an important Anglo-Saxon interacting with the biblical book of Judith. I’ve added this text to the corpus, and it now has an item page on the Omeka site, with a transcription of the selections from the Bamberg 17 manuscript.


* Discussed in his Text of the Old Testament, 222-35.

** Alcuin: Achievement and Reputation, Being Part of the Ford Lectures Delivered in Oxford in Hilary Term 1980 (Leiden, 2004).

** “Le De Laude Dei d’Alcuin,” in Alcuin, de York à Tours: écriture, pouvoir et réseaux dans l’Europe du haut Moyen Âge, ed. P. Depreux and B. Judic, Annales de Bretagne et des Pays de l’Ouest 111.3 (Rennes, 2004), 388-91

Digital Tools on the Cheap at NERCOMP

I spent yesterday at a workshop on “Doing Digital Humanities on a Shoestring Budget,” hosted by the Northeast Regional Computing Program. Since NERCOMP hosts events with a digital humanities bent to them, I’ve known about it for a while now, but hadn’t attended an event. A while back I was invited to present based on my work on this project, and I was happy to accept. It was a great event, exemplifying what I like about small, regional conferences in digital humanities: collegiality and collaboration. I met a number of people who were interested in my project, wanted to know more, and had suggestions for me.

What I presented was a hybrid of sorts. First, I gave an overview of my project (with which readers of this blog are familiar), talking about where I started and how I got to where I am now. Then I offered some general thoughts about what I’ve learned about digital humanities and digital projects generally, based on my trial-and-error experiences with this project. Here are my slides (in pdf format), which I think stand on their own without my extended discussion of individual points.

Publishing Text: Workshop and Goals

This past weekend, I attended a workshop on Publishing Text for a Digital Age, an NEH-funded Institute for Advanced Technologies in the Digital Humanities in collaboration with the Open Philology Project at the University of Leipzig and the Perseus Project at Tufts University. While there, I presented part of the Judith project, highlighting the Omeka archive and discussing my future goals with textual editing. I had an excellent time, met some great people, and received some extremely helpful feedback. So this post should be considered a public thanks to all involved.

My biggest goal on the textual editing side of the project is to create a single-text edition and facsimile of the single Anglo-Saxon manuscript of Hrabanus’ Commentarius, Arras, Bibliothèque Municipale, 764. Ultimately, I hope to create a dual facsimile and encoded text (XML/TEI) presentation. So far, I’ve been working from microfilm (in the Anglo-Saxon Manuscripts in Microfiche Facsimile series), comparing it to the most recent critical (collated) edition of the text (by Simonetti)–but to do a proper job, I need high-resolution images of the manuscript. All of this means that I’ll need to obtain digital photographs well as permissions to publish them online (under a CC license, open access). While I continue marking up the text that I’ve created from the microfilm, I’ll also pursue my manuscript needs. It’s proving a bit difficult, since the Bibliothèque Municipale in Arras is a smaller archive, and I’m still trying to follow leads to get in contact with the proper people to help me.

Having said all that, here are the slides that I used for my presentation, giving a general outline of what I said:

Those who attended the Publishing Text workshop were very helpful in their suggestions about all of this. So I’m getting back to my work with renewed vigor.

GIS Developments

This post marks a long break in the silence on this blog, although I have been at work on the project in the past few months. One area of development has been with the GIS-related aspect of the project. Since my goal is to develop some robust mapping tools for examining the Anglo-Saxon production and circulation of manuscripts containing texts in my corpus, there is a lot to do here.

Much of this has been facilitated by working with my friend and colleague Megg Goodrich (another graduate student at UConn), who has an excellent background and training in geography and GIS software. (Megg deserves many thanks for this work so far: I am extremely grateful for the work we’ve accomplished so far, as well as for all that I’ve learned about the advanced intricacies of GIS from Megg.) What I have learned about using advanced GIS software is this: there is a lot of planning before the map is even begun. We have been busy investigating, playing with, and customizing shape files, sorting the data that I have on the manuscripts, and organizing that data into proper, software-readable schema.

The first plan of attack that Megg and I established is to start with manuscripts still held in modern British libraries: 66 total manuscripts, representative of most (though not all) of the texts in the project corpus. Here is the spreadsheet for this dataset (in Google Drive). In it, we account for a number of elements that we want to incorporate into our mapping:

  • Location (City)
  • Library
  • County
  • Coordinates (lat/long)
  • Modern manuscript number or shelfmark
  • References to catalogues by Gneuss (Handlist of Anglo-Saxon Manuscripts) and Ker (Catalogue of Manuscripts Containing Anglo-Saxon)
  • Number of texts from the corpus contained in each manuscript
  • Titles of those texts
  • Language(s) of those texts
  • Approximate date (or range) for each manuscript
  • Location of origin for each manuscript
  • Coordinates for those origin locations (not completed yet)

This spreadsheet sets up our data for importing and working with in software such as Esri’s ArcGIS, which will be our primary focus.

This is not the place to detail everything we have accomplished in ArcGIS, or everything that needs done or that we plan to do–but we are progressing. Instead, I want to present a mapped version of our data in another GIS tool, Google Maps Engine Lite:

Manuscripts Containing Judith Texts in Modern British Libraries, created with Google Maps Engine Lite.

Manuscripts Containing Judith Texts in Modern British Libraries, created with Google Maps Engine Lite.

While this tool does not have the same robust capabilities of ArcGIS, it does have some nice features–especially its accessibility and ease of mapping data quickly. So here is a link to a dynamic map of the dataset we have so far. Expect more mapping progress in the future.

Seeking Peer Review

I’ve been thinking about peer review a lot lately. This is likely because evaluation of the “Studying Judith” project is one of the next big steps that I want to pursue–now that I have the archive published, as well as some preliminary results of text-mining. There is obviously a lot of discussion about peer review right now. One prominent work on this subject is Sheila Cavanagh’s article, “Living in a Digital World: Rethinking Peer Review, Collaboration, and Open Access” (JDH 1.4 (Fall 2012)). Cavanagh provides a salient discussion, ranging across many examples of digital projects, and raising many issues regarding evaluation. Especially poignant is her observation that “The web certainly can serve as an electronic vanity press, but it can also facilitate rapid and revisable dissemination of important scholarly material.”

Much of the scholarly discussion about peer review is particularly focused on creating standards for tenure and promotion evaluation. For example, Emory College recently released a memorandum “Regarding the Evaluation of Digital Scholarship,” adopted November 2013 for the purposes of tenure and promotion cases. The Carolina Digital Humanities Initiative at UNC is collecting links to other contributions on this subject here–several of which are documents about reviewing digital projects for institutional tenure and promotion cases.

My own thinking has gone in a different direction. While I do care about tenure and promotion–and I care about the ways that the system needs to undergo reform (not just because of digital culture)–it is not the primary issue I face with this project. Perhaps my alternative concerns are because I do not yet have a faculty position, nor the hope of tenure to go along with it, nor the subsequent pressure to produce scholarship that will come under evaluation.

Right now, I care about the project for its role in medieval studies and digital humanities, most specifically, what I can contribute to scholarship, teaching, and generally accessible knowledge for a subject about which I care. I am, in other words, more concerned about peer review for the sake of the project. I realize, of course, that any peer review will benefit future tenure and promotion work, or any other institutional evaluations of my scholarship. But at this point I’m most interested in gaining insights for quality control and suggestions to improve the project in future work.

So I’m sending out a formal call for peer review. I’ve put the project up on DHCommons in the hopes for feedback from others there (and to generate interest for collaborating as I move forward), but I’m also casting my net widely. I’m especially excited to gain input from those in medieval studies broadly, Anglo-Saxon studies specifically, and anyone practicing digital humanities. If you’re interested in reviewing the Omeka archive, please contact me–via the comments on this post, the contact page, email, or on Twitter.

Corpus Visualizations, Macroanalysis, and Close Reading

In what follows, I present some more results and thoughts from my text mining[1]—what Franco Moretti calls “distant reading” and what Matthew L. Jockers calls “macroanalysis”—and the data visualizations that have come out of that.[2] In doing so, I want to reflect upon (and tentatively propose some claims about) a set of related issues about using these data and visualizations for analysis: how they help us to understand the corpus of texts behind them through new means of computation; how they represent knowledge hermeneutically, as constructed, mediating, interpretive structures; and how they necessarily require a return to close reading to reach humanistic conclusions about the text corpus. I suggest that being simultaneously aware of these facets of dealing with data and visualizations is closely linked with traditional literary methodologies, which promote close reading alongside critical theoretical discourses. The goal in returning to close reading in our analyses of text-mining, then, is to be aware of and engaged with the interpretive discourses embedded in them, and to return the results of digital methods to the critical eye of humanistic criticism.

Theories of Data and Visualizations

The first facet of considering data and visualizations that I want to address is that of remediation—the crossing of media boundaries that drives and lies at the heart of digital culture. Recent scholarship has sought to conceive of visualizations as not only the results of data but also interpretive in their basic structures, thus mediating through a particular type of representation that should not be taken at face value, as there is an inherent tension between humanistic study and epistemological assumptions about scientific empiricism.[3] Johanna Drucker acknowledges this tension by discussing the notion of “situatedness” for the scholar:

By situatedness, I am gesturing toward the principle that humanistic expression is always observer dependent. A text is produced by a reading; it exists within the hermeneutic circle of production. Scientific knowledge is no different, of course, except that its aims are toward a consensus of repeatable results that allow us to posit with a degree of certainty….[4]

The implication, then, is that visualizations should be read critically. As Drucker has suggested elsewhere, “Visual epistemology has to be conceived of as procedural, generative, emergent, as a co-dependent dynamic in which subjectivity and objectivity are related.”[5]

The second, related facet of considering data and visualizations that I want to address is the way that each act of macroanalysis necessarily prompts a return to close reading. As Trevor Owens points out, “data offers [sic] humanists the opportunity to manipulate or algorithmically derive or generate new artifacts, objects, and texts that we also can read and explore. For humanists, the results of information processing are open to the same kinds of hermeneutic exploration and interpretation as the original data.” In other words, “any attempt at distant reading results in a new artifact that we can also read closely.”[6] A similar line of thinking arises from exploring the implications of the results we gain from the type of macroanalysis promoted by Moretti and Jockers. Throughout his book, Jockers emphasizes that macroanalysis is not a replacement for close reading, but a helpful counterpoint to it. Even more, we need to come to terms with the fact that the interpretation of data generated from text-mining is in itself an act of close reading: to make sense of macroanalysis, humanists necessarily engage in exegesis on a micro-analytical level. Jockers even hints at this idea when he writes that network data from macroanalyses “demand that we follow every macroscale observation with a full-circle return to careful, sustained, close reading.”[7]

The two methods of approaching these data and visualizations that I propose—using digital tools for analyzing the text corpus as well as critically questioning the results—are not mutually exclusive. Instead, I suggest that such critical questioning should lead scholars to consider a continual process of moving between multiple perspectives of reading, including distant as well as close (macro- and micro-analysis), in order to glean productive analyses of texts.

Word Frequencies

In this section, I present the data used as the basis of the visualizations, focused on word frequencies and connections between the most commonly used words. In a previous post I already discussed my methods and some preliminary findings for the Latin Vulgate Judith and Hrabanus Maurus’ Commentarius in Iudith, so these findings form a follow-up.

First, some words about the text corpus. Together, the entire corpus of texts considered in this project contains 232,725 characters and 34,609 words. I have divided the entire corpus of texts (49 in all, available on the Omeka site, most of them with translations) into two separate, though connected, corpora, since I have found that this yields more meaningful results. The reason for this is that the Latin works make up such a significant amount of the total that it skews what we can see about the Old English texts. For the most part, the corpus is split by languages, one for Latin and the other for Old English, though the exception is mixed (macaronic?) glosses—which have been included in the Latin corpus since they do not generally contain extensive Old English lexical items. The results of splitting into two corpora are as follows: Latin corpus contains 199,233 characters and 29,411 words; the Old English corpus contains 33,492 characters and 5,198 words.

In the following word frequency tables are data for the 30 most frequent words (not lemmatized) in the Old English and Latin corpora, excepting stopwords; data include both numbers of instances as well as relative frequencies (percentages of the whole). On the methods I used to generate these data, as well as the Latin stopwords list, see the previous post; the Old English stopwords list that I created is based on frequency data for the entire Old English corpus.

Click to englarge. N.B. Because of typographical variations, I have combined the following frequencies for accuracy: israhel/israel; and olofernis/holofernis.

Click to englarge. N.B. Because of typographical variations, I have combined the following frequencies for accuracy: israhel/israel; and olofernis/holofernis.

These frequency data may also be visually displayed in a graph, created in Tableau Public 8.0 (though you’ll have to click through to see the interactive graph): here.

I would like to note that even these data, as scientifically empirical as they appear–and as much as they create a sense of reliability of data analysis for the viewer–are constructed. For example, at the most basic level of my methods, the stopwords lists are themselves interpretive: the lexical items selected (or, more precisely, selected only to be omitted from analysis) were chosen based on linguistic assumptions about which specific parts of language are to be read as “significant” and “insignificant” for scrutiny. Yet even high-frequency, short function words may reveal meaning, as Moretti has demonstrated from tracing the roles of definite and indefinite articles across eighteenth- and nineteenth-century book titles.[8] Just as striking, upon reflection of this, is the fact that I have been guided by my own training in languages and linguistics; this is not empirical approach, but humanistic. There is, as Drucker has suggested,[9] a seeming tension between the humanistic and scientific assumptions inherent in these methods, but they also allow for productive analysis when used together.

Collocate Clusters

The series of visualizations to follow were created by uploading the Latin and Old English texts into the Links tool in the Voyant Tools suite, in order to create presentations of collocate clusters.[10] This tool uses statistical analysis to find the numbers of lexical collocates—sequences of words occurring together—and to present these occurrences as a network based on the frequencies of such occurrences. Thus, each lexical item is presented as a node (connecting point), with networks drawn between closely associated collocated words. For these visualizations, I have directed the Links tool to ignore stopwords, in order to render networks only of the most frequent terms in the corpus. In all of the visualizations below, the collocate clusters are presented with each node size representing the frequency of that lexical item in the corpus.

First, I present a visualization of the collocate clusters detected when the Latin text corpus was uploaded as individual files, for which each distinctive text and passage constituted an individual plain-text document.[11]

Collocate clusters in the Latin text corpus, with stops removed and nodes representing lexical frequencies.

Collocate clusters in the Latin text corpus, with stops removed and nodes representing lexical frequencies. Click to enlarge.

Second, I present a visualization of the collocate clusters detected when the Latin text corpus was uploaded as a single file, constituting every text compiled into one plain-text document.[12]

Collocate clusters in the Latin text corpus (as one file), with stops removed and nodes representing lexical frequencies.

Collocate clusters in the Latin text corpus (as one file), with stops removed and nodes representing lexical frequencies. Click to enlarge.

As can be seen, the connections between clusters across texts are more discernible in the second image (when the input is one file), allowing for a different type of analysis that emphasizes intertextuality even at a lexical level. The same issue is also apparent when comparing visualizations of the Old English text corpus, again uploaded as individual files and as a single file.[13]

Collocate clusters in the Old English text corpus, with stops removed and nodes representing lexical frequencies.

Collocate clusters in the Old English text corpus, with stops removed and nodes representing lexical frequencies. Click to enlarge.

Collocate clusters in the Old English text corpus (as one file), with stops removed and nodes representing lexical frequencies. Click to enlarge.

Collocate clusters in the Old English text corpus (as one file), with stops removed and nodes representing lexical frequencies. Click to enlarge.

It should also be apparent that the arrangements of nodes in these visualizations are significant—in both cases, I have arranged the nodes so that synonymous terms (e.g. deus, dei, dominus, domini, god, godes, etc.) may be viewed in proximity, to highlight the connections between those and other terms. These visualizations are further manipulable by changing node sizes (in the tool options) from representing lexical frequencies (as in these images) to representing numbers of associations each lexical item has to others. These arrangements, then, are not random, but purposeful and hermeneutic; these visualizations of data do present statistical associations between lexical items, but the arrangements of these clusters are presented as already processed through my own humanistic interpretations. While empirical, scientific approaches (on which many digital humanities endeavors rest) value quantifiable, verifiable, and repeatable results, these visualized collocate clusters hardly deliver significant results within these parameters. These claims are not meant to denigrate such visualizations or the work that goes into creating them, but to point out that, in presenting useful ways of seeing the corpus, they do so as interpretive, qualitative, and graphically constituted representations of data.[14]

If we step away from the epistemological assumptions of scientific inquiry, however, the long-standing humanistic tool of close reading offers a method of teasing out significance from these visualizations in ways that do not depend upon empirical notions of certainty.[15] In this way, the significance of these visualizations may be examined despite them being, or perhaps even because they are interpretive, qualitative, and graphically constituted. What can we say about these clusters?

The most obviously striking element is the confirmation of connections surrounding the notion of collective identity. The centrality of Israhel (and variants) is apparent from its common occurrence across the corpus—it is the fourth most frequent word, with 91 instances—and the collocate clusters only further emphasize this significance: it is associated directly with Iudith, words for the deity (deus, dominus, and their variants), as well as filii (children), populum (people), and omnes (all), the latter creating further indirect associations with ciuitates (cities), omnem (all), and (further out) terram (land). I mentioned the significance of Israelite identity in my previous post, since it is central to the biblical book of Judith, but it is just as important to see these connections upheld when the whole Latin corpus is examined. Furthermore, these same associations occur even in the Old English corpus: ealle (all) is directly associated with Iudith, god and gode (both for god), as well as lande (land), through which further connections are Iudea, folc (people), and (from folc) Israhela. One more set of text-mining results is relevant to these findings. When I analyzed the Latin corpus by using the List Word Pairs tool at TAPoR,[16] the results revealed that the two most frequent word pairs were filii Israhel (18 instances) and omnis populus (15 instances).

Two implications of these analyses may be followed. First, the results of macroanalysis, ranging from data about word frequencies and word pairs as well as visualizations of collocate clusters, are consistent. Yet, second, they also prompt interpretation that most benefits from contexts provided by a long tradition of humanistic study of the subject at hand. To modern scholars, the issue of collective identity is obviously key to the Israelite peoples, and no less so in the climate of anxiety during the late Second Temple period (c.200 BCE to 70 CE), when the Greeks, Egyptians, Romans, and various other middle eastern empires vied for control of their homeland. This is, in fact, thematically at the root of the biblical story of Judith, hence the connections I previously pointed out. Yet the clustering of Latin and Old English texts connected to the Vulgate Judith also show that medieval people continued to share this concern. In other words, medieval people—including the Anglo-Saxon authors who composed the Old English texts—were no less able to identify a key issue than modern scholars have been. This also raises questions about how and why Anglo-Saxons would have capitalized on Old Testament themes of collective identity, like Bede did in his Historia ecclesiastica when linking Anglo-Saxons to the Israelites as continuations of the chosen people of the Old Testament covenant. While this is not the place for exploring these issues, they do provide provocative avenues for further study that may fruitfully emerge from the critical engagement between  digital tools and close reading the results through a humanistic lens.

[1] Data and visualizations referred to in this article may be viewed and downloaded at the open access GitHub repository for this project, uploaded on October 24, 2013, at

[2] See,  most recently, Franco Moretti, Distant Reading (London: Verso, 2013); and Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (Urbana, IL: U of Illinois P, 2013).

[3] See esp. essays included in the special issue of Poetess Archive Journal 2.1 (2010),; Johanna Drucker, “Humanities Approaches to Graphical Display,” Digital Humanities Quarterly 5.1 (2011),; and “Humanistic Theory and Digital Scholarship,” Debates in the Digital Humanities, ed. Matthew K. Gold (Minneapolis, MN: U of Minnesota P, 2012), 85-95, and online at; as well as Laura Mandell, “How to Read a Literary Visualization: Network Effects in the Lake School of Romantic Poetry,” Digital Studies 3.2 (2012),; and Tanya Clement, “Distant Listening or Playing Visualizations Pleasantly with the Eyes and Ears,” Digital Studies 3.2 (2012),

[4] “Humanistic Theory,” 91.

[5] “Graphesis,” Poetess Archive Journal 2.1 (2010), at

[6] “Defining Data for Humanists: Text, Artifact, Information or Evidence?” Journal of Digital Humanities 1.1 (2011),

[7] Macroanalysis, 168.

[8] Distant Reading, 179-210.

[9] “Humanistic Theory and Digital Scholarship.”

[10] Sinclair, Stéfan, and Geoffrey Rockwell, “Collocate Clusters,” Voyant, 2013,

[11] The Links tool with the Judith Latin corpus data already uploaded may be accessed here:

[12] The Links tool with the Judith Latin corpus data (as a single file) already uploaded may be accessed here:

[13] The Links tool with the Judith Old English corpus data already uploaded may be accessed here: The Links tool with the Judith Old English corpus data (as a single file) already uploaded may be accessed here:

[14] Cf. Drucker, “Humanities Approaches to Graphical Display.”

[15] Here I take my cues specifically from Drucker, “Humanistic Theory and Digital Scholarship.”

[16] Taporware Project, McMaster University, 2013, For the Latin corpus, there are 22,097 unique word pairs, with 29,432 word pairs in total; 17,889 word pairs occurred once and 2,990 word pairs occurred twice. With stops removed, there are 14,845 unique word pairs, and there are 17,407 word pairs in total; 12,843 word pairs occurred once and 1,670 word pairs occurred twice. Unfortunately, lack of support for Unicode (UTF-8) precludes using this tool effectively with Old English, due to special typographical characters (Æ/æ, Þ/þ, Ð/ð, Ƿ/ƿ). This is an instance that demonstrates that not all tools play nicely with medieval languages and literature.

Scholars’ Collaborative Presentation

Yesterday I presented my work in progress at a brown bag talk hosted by UConn’s Scholars’ Collaborative, and had a great time with it. The feedback and discussion after my presentation was very useful, and provided some more avenues for my future thinking and work on this project. I thought I’d share the Prezi that I used, so here’s the link. Thanks to all who attended!

Omeka Launch

Today, I officially launched phase one of this project: the Omeka archive (constellation) of texts, hosted by the University of Connecticut Scholars’ Collaborative. I’m excited especially to announce this during Open Access Week, since I want all of this project to eventually be online, open-access–it is, after all, intended for others.

So what else do I want to pursue with this project? The short answer is quite a lot. The long answer I’ll give as a list with explanations (in no particular order):

Exploring peer review options and seeking peer reviewers (formal and informal) to give feedback and to help improve the project as it progresses

Text mining the corpus that I’ve compiled, which I’ve already started in the R programming language (probably the most viable next step)

Mapping the texts as they circulated in manuscripts across time and geography in Anglo-Saxon England (I’m currently playing with Neatline in Omeka)

Tagging the entire corpus of texts in XML/TEI, for long-term posterity as standardized, edited versions

Translating the longer texts–the Old English poem Judith, Ælfric’s Old English sermon De Judith, and Hrabanus Maurus’ Commentarius in Iudith–for addition to the Omeka site

Editing Hrabanus Maurus’ Commentarius in Iudith in an XML/TEI-marked up version for sustained presentation online

Creating interactive, user-based tools for exploration and analysis (corpus search, etc.)

I expect that some of these could take quite a bit of time, so I’ll be busy with this project for the next several years. In the meantime, enjoy the archive, and feel free to contact me if you have any feedback.

Some Text Mining Results


In this post, I want to introduce some data acquired by running text mining on a few documents in my Judith corpus. To set some baselines for future analyses, I first began my work with the primary source at the heart of my corpus, Jerome’s Vulgate translation of the biblical Judith. Because I’ve focused so much on the ninth-century Commentarius in Iudith (composed c.830) by Hrabanus Maurus (c.780-856), and because Hrabanus gives such comprehensive comments on the Vulgate, I then worked out some preliminary analyses on this text. The Commentarius ended up being a surprisingly good text for baseline comparisons that I hope to further extrapolate out to the other corpus texts. I also found it useful, for a secondary round of word frequency analysis on this text, to strip out all of the biblical lemmata that precede Hrabanus’ comments; for this, I eliminated only the first instance of biblical quotation, ignoring subsequent words or phrases that referred back to the biblical passage in the body of Hrabanus’ comments. (For notes about my process, software used, and variance issues raised by these analyses, see the bottom of this post.)

First, I want to outline my process and the software that I used for obtaining the following results. I initially obtained word frequency data using the online Lexos tools provided by the Wheaton College Lexomics project. After initially using the Lexos tools, I began working with the Python programming language (in Komodo Edit), following tutorials at The Programming Historian 2 and Code Academy. While I found these resources helpful for learning basic coding, I shifted to using the R programming language (in RStudo), following lessons by Matthew L. Jockers in his Text Analysis in R for Students of Literature (forthcoming), which he provided on his website as a draft manuscript for review; I found Jockers’s approach most helpful for beginning basic analysis and gaining results even in the first few lessons. I again obtained (and confirmed) the same frequency data and incorporated them into my subsequent analyses with R. The stopwords list that I used throughout my analyses was created by Paul Evans, derived from a concordance of the most common function words in Latin, as in Wortkonkordanz zum Decretum Gratiani, ed. Timothy Reuter and Gabriel Silagi, 5 vols., MGH, Hilfsmittel 10 (Munich: Monumenta Germaniae Historica, 1990), 1:ix-x; my thanks to Paul for sharing this list with me.

Word Frequencies

First, here is some basic data on word frequencies in the Vulgate version of Judith and Hrabanus’ Commentarius. The tables in the image below provide data for the 30 most frequent words in the texts, excepting stopwords.

(N.B. Because of the scribal variation in the manuscript version of Hrabanus' Commentarius, I have combined the following frequencies for accuracy: olofernis/holofernis; ecclesia/aecclasia/aeclesia; and ecclesiae/aecclasiae/aeclesiae.)

(Click to enlarge. N.B. Because of the scribal variation in the manuscript version of Hrabanus’ Commentarius, I have combined the following frequencies for accuracy: olofernis/holofernis; ecclesia/aecclasia/aeclesia; and ecclesiae/aecclasiae/aeclesiae.)

What jumps out at me from these frequencies are two main themes in the Vulgate text: place (Israhel, terram, ciuitatem, Hierusalem) and collective identity (Israel, omnes/omnibus/omnis/omnem, filii, Assyriorum, populus, ciuitatem). These same terms largely carry over into the medieval commentary. The cultural links between these two notions are also well known–particularly for biblical and medieval writers–further solidifying the significance of these themes represented together through the lexis of these texts.

Hrabanus’ Commentarius also provides some significant points for comparison, especially for showing fissures and connections between the Jewish and Christian characteristics of these texts. Particularly telling in this respect are the emergence of Christian concepts and terms such as Christiecclesia/ecclesiae (in all of their variants), uitae, euangelio, and fidei. Even more pronounced is the fact that when the lemmata are omitted from the Christian commentary, Israhel drops out of the top 30 words altogether, eclipsed by the typological interpretation of the people of God as Christian ecclesia/ecclesiae. Looking at only commentary and no lemmata, in fact, we see that instances of the word Israhel drop to only eight occurrences (relative frequency of 0.001097093) in the whole text.

Distribution Plots

The next data set extrapolates some of these results. For the following, I created distribution plots of single words across these texts. In other words, the graphs below provide data about each word as it appears from beginning to end of each text: each line represents one instance of the word. I thought it would be fruitful to plot the most frequent characters–including Israhel and ecclesiae, which I would argue represent types of collective characters (representatives of God’s people) in these texts. I place these plots here together not only to show their own values but also for comparison.

Distribution plots for Vulgate Judith



DistPlot_NabuchodonosorDistPlot_Holofernis DistPlot_Iudith DistPlot_Israhel


Distribution plots for Hrabanus’ Commentarius

DistPlot_deusDistPlot_dominusDistPlot_Nabuchodonosor DistPlot_Holofernis DistPlot_IudithDistPlot_IsrahelDistPlot_aecclesiaeDistPlot_aecclesiaDistPlot_Christi

I plan to run further plotting of some of the other terms, especially for comparison of the themes of place and collective identity that I indicated above. What I do want to highlight for the moment is the further emphasis on typology that is visualized in the plots for Israhel and ecclesia/ecclesiae throughout Hrabanus’ Commentarius. Placed together for comparison as they are above, Israhel and ecclesiae especially map well together through the text: they first appear in approximately the same place in the text, their plots are similar up until frequencies are diminished (only two instances of Israhel, none of ecclesiae) in the middle of the text, and sustained instances of the two emerge again in parallel fashion in the last quarter of the text. Again, the link between Jewish Israelite identity and Christian ecclesiastical identity is apparent in Hrabanus’ Commentarius. While this type of typological interpretation is expected in Christian biblical commentary, these computer-derived data go a long way to putting the concept in perspective with a different means of analysis.

A Note on Variance Issues

Running these analyses did reveal some issues with working with Latin texts, especially in variants (as I point out in my notes above, especially concerning variant forms of Holofernis and ecclesia/-ae). Fortunately, the majority of texts in my whole corpus are standardized, though this does present a problem to keep in mind. I expect that even more issues of textual variance will crop up when I start working with Old English texts, since they are far from being uniform in dialects, even when presented in edited, modern, standardized forms. Related to this obstacle, I have yet to come across a satisfactory solution to variance in constructing an Old English stopwords list (see here for some discussion). Surely tackling such issues is a place for some innovation in coding.