Some Text Mining Results
In this post, I want to introduce some data acquired by running text mining on a few documents in my Judith corpus. To set some baselines for future analyses, I first began my work with the primary source at the heart of my corpus, Jerome’s Vulgate translation of the biblical Judith. Because I’ve focused so much on the ninth-century Commentarius in Iudith (composed c.830) by Hrabanus Maurus (c.780-856), and because Hrabanus gives such comprehensive comments on the Vulgate, I then worked out some preliminary analyses on this text. The Commentarius ended up being a surprisingly good text for baseline comparisons that I hope to further extrapolate out to the other corpus texts. I also found it useful, for a secondary round of word frequency analysis on this text, to strip out all of the biblical lemmata that precede Hrabanus’ comments; for this, I eliminated only the first instance of biblical quotation, ignoring subsequent words or phrases that referred back to the biblical passage in the body of Hrabanus’ comments. (For notes about my process, software used, and variance issues raised by these analyses, see the bottom of this post.)
First, I want to outline my process and the software that I used for obtaining the following results. I initially obtained word frequency data using the online Lexos tools provided by the Wheaton College Lexomics project. After initially using the Lexos tools, I began working with the Python programming language (in Komodo Edit), following tutorials at The Programming Historian 2 and Code Academy. While I found these resources helpful for learning basic coding, I shifted to using the R programming language (in RStudo), following lessons by Matthew L. Jockers in his Text Analysis in R for Students of Literature (forthcoming), which he provided on his website as a draft manuscript for review; I found Jockers’s approach most helpful for beginning basic analysis and gaining results even in the first few lessons. I again obtained (and confirmed) the same frequency data and incorporated them into my subsequent analyses with R. The stopwords list that I used throughout my analyses was created by Paul Evans, derived from a concordance of the most common function words in Latin, as in Wortkonkordanz zum Decretum Gratiani, ed. Timothy Reuter and Gabriel Silagi, 5 vols., MGH, Hilfsmittel 10 (Munich: Monumenta Germaniae Historica, 1990), 1:ix-x; my thanks to Paul for sharing this list with me.
First, here is some basic data on word frequencies in the Vulgate version of Judith and Hrabanus’ Commentarius. The tables in the image below provide data for the 30 most frequent words in the texts, excepting stopwords.
What jumps out at me from these frequencies are two main themes in the Vulgate text: place (Israhel, terram, ciuitatem, Hierusalem) and collective identity (Israel, omnes/omnibus/omnis/omnem, filii, Assyriorum, populus, ciuitatem). These same terms largely carry over into the medieval commentary. The cultural links between these two notions are also well known–particularly for biblical and medieval writers–further solidifying the significance of these themes represented together through the lexis of these texts.
Hrabanus’ Commentarius also provides some significant points for comparison, especially for showing fissures and connections between the Jewish and Christian characteristics of these texts. Particularly telling in this respect are the emergence of Christian concepts and terms such as Christi, ecclesia/ecclesiae (in all of their variants), uitae, euangelio, and fidei. Even more pronounced is the fact that when the lemmata are omitted from the Christian commentary, Israhel drops out of the top 30 words altogether, eclipsed by the typological interpretation of the people of God as Christian ecclesia/ecclesiae. Looking at only commentary and no lemmata, in fact, we see that instances of the word Israhel drop to only eight occurrences (relative frequency of 0.001097093) in the whole text.
The next data set extrapolates some of these results. For the following, I created distribution plots of single words across these texts. In other words, the graphs below provide data about each word as it appears from beginning to end of each text: each line represents one instance of the word. I thought it would be fruitful to plot the most frequent characters–including Israhel and ecclesiae, which I would argue represent types of collective characters (representatives of God’s people) in these texts. I place these plots here together not only to show their own values but also for comparison.
Distribution plots for Vulgate Judith
Distribution plots for Hrabanus’ Commentarius
I plan to run further plotting of some of the other terms, especially for comparison of the themes of place and collective identity that I indicated above. What I do want to highlight for the moment is the further emphasis on typology that is visualized in the plots for Israhel and ecclesia/ecclesiae throughout Hrabanus’ Commentarius. Placed together for comparison as they are above, Israhel and ecclesiae especially map well together through the text: they first appear in approximately the same place in the text, their plots are similar up until frequencies are diminished (only two instances of Israhel, none of ecclesiae) in the middle of the text, and sustained instances of the two emerge again in parallel fashion in the last quarter of the text. Again, the link between Jewish Israelite identity and Christian ecclesiastical identity is apparent in Hrabanus’ Commentarius. While this type of typological interpretation is expected in Christian biblical commentary, these computer-derived data go a long way to putting the concept in perspective with a different means of analysis.
A Note on Variance Issues
Running these analyses did reveal some issues with working with Latin texts, especially in variants (as I point out in my notes above, especially concerning variant forms of Holofernis and ecclesia/-ae). Fortunately, the majority of texts in my whole corpus are standardized, though this does present a problem to keep in mind. I expect that even more issues of textual variance will crop up when I start working with Old English texts, since they are far from being uniform in dialects, even when presented in edited, modern, standardized forms. Related to this obstacle, I have yet to come across a satisfactory solution to variance in constructing an Old English stopwords list (see here for some discussion). Surely tackling such issues is a place for some innovation in coding.