Instant History

These are my notes about the Instant History, The Postwar Digital Humanities and Their Legacies: A Day Conference. The conference looked at Father Busa and his legacy.

Note these were mostly written live so they will be full of typos and incomplete thoughts. That's what you get for conference notes.

Geoffrey Rockwell: Tremendous Labour: Busa's Methods

I gave the first talk and talked about the reconstruction of Busa's Index project. I claimed that Busa and Tasman made two crucial innovations. The first was figuring out how to represent data on punched cards so that it could be processed (the data structures). The second was figuring out how to use the punched card machines at hand to tokenize unstructured text. I walked through what we know and talked about our attempts to replicate the key methods:

Simple Punched Card Emulator: https://cdn.rawgit.com/sgsinclair/epistemologica/master/punchcard.html
Busa Operations: http://nbviewer.jupyter.org/github/sgsinclair/epistemologica/blob/master/PunchcardOperations.ipynb

The respondents (Kyle Roberts and Schlomo Argamon) pointed to some of the interesting contextual issues:

We need to pay attention to the Jesuit and spiritual dimensions of Busa's work
We need to think about the dialectic of those critical of computing and those optimistic about it

Thanks to the Library of the Catholic University of the Sacred Heart, Milan for access to and permission to show materials from the Busa Archives.

Steve Jones: Reverse Engineering the First Humanities Computing Centre

Steve gave a great talk that started with Busa's CAAL (Centre of Automatic Analysis of Language/Literature) in Gallarate. He started with the physical space and showed photos he took when he tried to find the ex-textile factory which has now been renovated. He talked about the space being tied to machinery, to Jesuit culture, to the operators, and philology. He showed images drawn from the Busa Archives. He did a great job of spinning things out from the space including:

He talked about how Busa raised money for the CAAL and links to local business people. We all know about the IBM support, but we don't know about the local network he tapped into.
He talked about the connection with atomic research and cold war funding. Busa connected a group in Georgetown to IBM who developed systems for translating Russian. He also connected with Euratom near Gallarate. Busa was adept at connecting his needs to the agendas of others. I think more needs to be done to track how he wove his agenda into that of other currents.
He talked about the machines in the space and showed images of them. Along with that he talked about the other uses of the space including Busa's office space which also seemed to double as a conference space.

Steve has a web site for his book Roberto Busa S.J. and the Emergence of Humanities Computing with some of the images.

Ted Underwood: Genealogies of Distance

Ted started by talking how this is the right time to do the work of the history of digital humanities. We have different views of the history, but we are

No one doing "distant reading" has been doing work on the history. And that is what Ted proposed to do. A broadly historical perspective is central, computers not.

He talked about Amy Earhart and her reading of the history of textual editing. Lauren Klein and Matt Gold also talk about scale as an important recent theme to DH. For Ted scale and distance have been there since the beginning on literary history, before computers.

What is new in the last 40 years is an approach to history based on consciously constructed samples (rather than useful quotes.) Literary scholars are starting to have a discussion about method with quantitative social scientists about sampling and methods. Before lit scholars would borrow ideas from sociologists, but not methods. The exception was linguistics and stylistics, but even then they worked with smaller collections.

Reading the Romance by Janice A. Radway was an important/influential book that used social science methods to challenge the view that popular literature simply transmitted gender norms. Radway argued that you have to look at the consumption of book (ethnography) and then you see that communities make what they want of books. She used questionnaires, interviews, and quantitative analysis. These methods are not complex, they rely mostly on counting. She did content analysis using coding schemes similar to what is used to study mass media.

Then he shifted to talk about Franco Moretti and how he used similar methods in "Slaughterhouse of Literature" (2000). He called this distant reading - a term that doesn't say anything about computing. The social reality is that those interested in distant reading have come mostly from computing. In effect, a structuralist thread merged with sociology of literature which then appealed to and merged with humanities computing (corpus linguistics.)

Content analysis is often binary. Newer methods can bring in subtleties of probabilistic methods or perspectival methods.

He asked why we are now so focused on computing when the history is so rich. People invent terms or use terms that show their role or their innovation. To make it sound like scale is new is misleading when in fact it has been around. He wants to redress things and wants to emphasize sociological methods.

Social science methods by themselves don't need computers (as they run on surrogates or models). Ted talked about the shift from summary to scene - different relationships between reading time and story time. He described a project where they read random passages in a number of authors and tagged them according to time. The different readers mostly agreed about time getting shorter in the 18th century. Interestingly best sellers led the trend in lots of prose narrating small amounts of time, not Proust or Joyce.

So how could we miss the trend which is out in the open? There is a lot of variation. There are other ways we talk about time as in liking action scenes (which take place in little time.) Ted thinks we must not have known what we knew.

Grasp patterns that are not visible at the scale of reading.

Computational methods alone detached from broad social samples haven't contributed to literary history. Mark Olsen in "Signs, Symbols and Discourses" argued that the methods hadn't really made much of a difference in the 1990s. Ted conceded that this could have been his personal story.

The one-sided emphasis on technical genealogies is creating an unproductive conversation in digital humanities. The big-tent approach may do a disservice as it ties the distant readers seem to computers. He feels the debate about numbers is unproductive as distant reading really has to do with the social sciences.

Respondents

We then had respondents. The first, Cornelius (?), talked about questionnaires and how they have been used. These have been used in all sorts of ways in libraries and the history of books. Distant reading establishes its own interpretative context. The second respondent, Lynda (?), started by how there have been predictions about the demise of the humanities. She asked about how the new methods of distant reading and how they are taught. Do we go from traditional to distant methods? The third respondent wanted to hear how computing methods would be integrated into literary historical education. The final respondent asked if the digital hasn't changed the text itself.

Ted admitted that computers add a lot and that in this paper he trying to rebalance.

Laura Mandell: What Can you do with "Dirty OCR"?: Digital Literary History Beyond the Canon

Laura gave some background on the project and how she is trying to figure out what gets lost and found in dirty OCR. She talked about how we need to try to understand past meanings in their own terms. We need to reconstruct the reasonableness of past ideas rather than just say people are wrong. Kuhn does this with Aristotle. We ask what sorts of culturally bound assumptions must have held for past opinions to have made sense. Now, can we do this with big data. Laura insists that the answer is yes.

Scholars often use the OED to understand the evolution of words. Now we have big data methods.

TextDNA is a tool that analyzes text as a sequence to find when words were first used
Bookworm and Artemis claim to let us study word usage over time - Laura doesn't think they do

Bad data is overwhelming digital tools and rendering them useless. Massive amounts of faulty info could bury the truth. With OCR we have all sorts of problems from the long S to alternative spellings. The quality of the originals, the quality of the surrogates scanned, the OCR methods all introduce noise.

Gale ran multiple OCR algorithms and then voted on the word. Gale is OCR is very good. eMOP OCRed the same stuff as Gale as did Google. eMOP is not as correct as Gale but in different ways. Google also used Tesseract (as did eMOP), but trained very differently. Google ngram viewer isn't searching books where the OCR was poor. They exclude all sorts of pre-1750 books.

Laura wanted to figure out how bad Gale and Google were. To do this she needed to find all the things not properly OCRed, but you can compare OCR results. They also could use all sorts of spelling variants based on OCR errors with fuzziness. This way you can see the true negatives. The design of research results makes a difference to the humans interpreting results.

If simple searches can't find words, then text mining won't be able to.

Laura is interested in how 18th century thinkers think of information. It was more alive then. Information then had to do with a person informed. Information had to do with shaping of minds. Info wasn't passive in 18th century.

She talked about Ian Hacking's work on changes in how we view information vs testimony. Information has to be deadened for probability to emerge. Laura searched for ngrams that would indicate the shift in information. Gale only finds about half of what one is looking for. She looked for circumstance and remarkable information.

The new Google neural net engine is much better. (And we need access to it.)

The first respondent talked about how "dirty" sounded like labor she didn't want anything to do with it. Dirty OCR is rendering the tools incompetent. Truth is reconfigured in these search engines. Laura has given us a way to explore the effects of dirty OCR. She talked about how we have to learn to question the data and engines before we can draw inferences.

A second respondent emphasized how important clean texts is. She mentioned the importance of middleware. We need to look at tools, interfaces, middleware, data, to understand the formation of knowledge.

A third respondent talked about reference books and how they have served as our search engines for centuries. She commenting on how the large search engines have changed what we use as reference tools. We use the big data tools rather than human structured information in traditional resources. The traditional reference works work very well

In the age of full text engines, what is the point of an index? Reference works aren't being used so they are being thrown out. Digitization has dramatically changed the infrastructure of literature.

In some cases the digital searching is actually based on print indexes, not the full text. Further, reference technologies like indexes have a history that can be valuable.

The final respondent found reassurance in how Laura has shown us how to work with dirty texts. She pointed out how we have original texts to check, but in many big data cases there is no original preserved.