These are notes on the DHCS colloquium at IIT in Chicago, November 15th and 16th.

"Without thought one can computer infinite nonsense."

An interesting feature of the talks was how the computer scientists apologized for not be humanists and vice versa. Everyone was aware that their talks crossed fields that most of us are not fully comfortable with.

Citation Detection and Textual Reuse on Ancient Greek

Marco Buchler presented on a project joint between classicists and computer scientists. They are trying to find n-grams between 2 and 5 words that occur multiple times across the TLG corpus.

Their portal is at http://www.eaqua.net/index.php (in German)

Marco demonstrated a visualization that shows language evolution in passages that reappear. This is designed to help philologists look at differences. Then he showed an interesting visualization of a network of words over time. I'm not sure how it was generated.

"Plagiarism is to copy from one but to copy from two is research."

He showed a visualization showing which parts of Plato's Timaeus are most cited in which period which would let users see what interests people in Plato at different times.

A problem for such citation mining is that multi-word expressions like "Alexander the Great" will show up over an over. It isn't really a citation or reuse.

He concluded with an interesting summary of what different stakeholders want. Philologists want a micro view, historians want a macro view and computer scientists are interested in algorithms.

Mapping Genre Space via Random Conjectures

Patrick Juola gave a great talk about the Conjecturator that he has developed. It is based on the automatic experimentation paradigm. An example is Adam that does minor tweaks on genes and generates experiments. Graffiti generates conjectures (random graphs). If a conjecture passes tests then there is something for a mathematician to look at. The Conjecturator generates random hypotheses about a large text corpus. If the conjecture meets certain threshholds then it is passed to humanists for research.

He has a twitter feed: http://www.twitter.com/conjecturator

An example might be "(Male Authors) use (animal terms) more (or less) than (Female Authors)." The parts of the conjecture (in parentheses) can be randomly generated. These are tested against a large corpus where all the texts are tagged by gender and there is a thesaurus of categories like "animal terms".

On a handout he had a graph of a bunch of genres (late_Victorian_novels) and used multidimensional scaling to see which genres are closest. He has about 8000 differences based on the thesaurus categories. He calculated the difference between every two genres and then mapped the differences. He talked about some of the conclusions one might draw from the map for further research.

One could imagine running this against the Google Book collection or other large scale corpora. Some questions he concluded with were:

What makes an interesting conjecture? Can we eliminate uninteresting ones?
What makes interesting differences?
How can this be improved?
What can we do with a pile of facts?
Does this "distant reading" assist research?

There were questions about how the genre categories were generated. There was also an interesting suggestion to generate random genre clusters, test them against each other, and see what random clustering works best (provides the most clear difference between clusters.) Then one could ask what tests generated the differences.

On the Origin of Theories: The Semantic Analysis of Analogy in a Scientific Corpus

Devin Griffiths (Rutgers Centre for Cultural Analysis) talked about analogy and Darwin's The Origin of Species.

Analogy started in Greece as a way of talking about mathematical similarities. Aristotle broadened it to epistemological and hermeneutical properties. The 19th century anatomists relied on it. It is a form of pattern recognition or probablistic reasoning. Cognitive scientists talk about the deep role analogy plays in our experience.

Devin then moved onto Darwin. Darwin noticed that individuals of a domesticated species vary much more than wild species. There are many levels of analogy here.

Devin uses Morphadorner to tag The Origin of Species with part of speech information. He then searched for patterns of POS that might be like analogies. He used Latent Semantic Indexing (LSI) to find analogies using a control sentence. He then looked for sentences that matched both constraints (POS and LSI).

His conjecture about what analogy does is that they are used to coordinate expressions or to expose inconsistencies (things that are not analogies). Thus analogies should show where builds his paradigm and argues against others. They are used to construct new paradigms. He showed a graph of the number of analogies over the chapters.

The Big See: Large Scale Visualization

Geoffrey Rockwell (me) and Garry Wong presented on the Big See project. We have been looking at the design constraints and opportunities offered by Large Scale Information Displays (data walls, CAVEs, or other shared displays.) We showed two visual ideas, The Big See (see the site for a literature review), and LAVA.

New Insights: Dynamic Timelines in Digital Humanities

Kurt Fendt (HyperStudio, MIT) presented on visualizations of timelines. He started with the idea that mining and visualization are methodologies in the humanities. He first showed the SIMILE Timeline, which is used widely. He then went back and showed pre-computer timelines starting with Jacques Barbeu-Duborg Carte chronologique (1753). He showed Joseph Priestley's New Chart of History (1769) - an early history timeline and again, a very large chart. He also showed an interesting Eugenics Diagram (1913) by Aurthur Estabrook and Charles Davenport which is circular not linear.

He then looked at web-based timelines starting with the Valley of the Shadows. Christopher York has a typology of Digital Timeline Uses. (See Digital Humanities Timelining Report, MIT, HyperStudio, 2009, PDF.) They can be used for rhetorical purposes or constructivist (for note taking.) He showed a number of museum history and art history timelines. Xtime hasa social network timeline, timefo is another.

He talked about timelining as a collaborative activity where groups can author a line so that crowdsourcing can be used. He ended with some timelines they are building (one using SIMILE and an other not) for representing US-Iran relations. Emergent Time lets people contribute timelines and then it lets you see other timelines with overlapping time periods. Thus you can see different interpretations of important events. It is set up as a social network.

One thing that is clear is that timelines create an interpretation of time. All of the examples he showed were of chronological time representations. Very few have tried to do what Johanna Drucker has tried with the Temporal Modelling project - to show phenomenological time (as the types of time that show up in novels where events are anticipated and so on.) It would be interesting to apply Heidegger's ideas about time to the issue.

The HyperStudio has a blog entries on the timeline issues, http://hyperstudio.scripts.mit.edu/news/?p=46 with some important references:

Daniel Rosenberg, �The Trouble with Timelines,� Cabinet 12 (Spring 2004).
Daniel Rosenberg and Sasha Archibald, �A Timeline of Timelines,� Cabinet 12 (Spring 2004).

National Endowment for the Humanities and the Office of Digital Humanities

Michael Hall of the NEH talked about the Digital Humanities initiatives at the NEH.

Humanities Viewed as Information Science

Vasant Honavar of the AI Research Lab at Iowa State gave the keynote after lunch. He started by acknowledging Vincent Atanasoff. The running theme of his talk is computational thinking - ie. thinking about things in a computational fashion as in cognitive science. He argued that questions about truth and argument from Plato on influenced computer science. He mentioned Leibnitz, Boole, Peirce and so on up to Turing. This struck me as a story of computing as a way of automating thinking (as opposed to extending it.)

He talked about the Turing machine and the idea of effectively computable. "Anything that is describable is describable by a computer program!" Computing therefore offers a universal medium for representing information.

He then talked about descriptions - formal languages and grammars. He then talked about semantics - processing descriptions and mentioned Tarski's work on theory of reference. He the mentioned Shannon's information theory leading to Kolmogorov complexity.

It was interested how he regards the humanities. He argued the difference is that the humanities and sciences use different methods (speculative vs empirical). I would counter that humanists don't think much about their methods. He pointed out how the methodological difference is a cartoon because many sciences can use empirical methods (cosmology as there is only one and you can't test it) and the humanities sometimes use empirical methods (authorship attribution.)

He gave an interesting example of trying study Indus valley script which has yet to be decoded. Another example was trying to figure out how languages are related to each other using different language versions of the Human Rights Declaration. Cluster analysis found most of the recognized language groups, but English seemed to be a Romance language close to Sardenian. He suggested that all the tools are united by looking at the information complexity of some string.

He closed by saying that we are all interested in descriptions. Is this true? Are humanists only interested in formalized representations that are therefore computable. Anyway, if that is one thing we are interested in then computer science can help us. Honavar agreed that there are other computing traditions around extending the mind that are also applicable.

Poster and Demonstration Session

At the end of the day there was a poster session.

Monday, November 16th, 2009

Computational Phonostylistics: Computing the Sounds of Poetry

Marc Plamondon (Nipissing University) talked about the problem of analyzing poetic style. He proposes a theory of phonemic accumulations - that accumulations of phonemes enhances their effect. More plosives close together reinforces their effect. He then graphs the accumulations of phonemic groups. He discussed how phonemes can interfere with each other thus one can take the graph of plosives and subtract the graph of fricatives. You can then talk about parts of a poem where one type of phonemic group dominates and correlate that with different rhetorical effects.

He uses fourier transforms to smooth out the graphs. He also maps the phonemic sounds to 2 dimensional layout of the poem, which is interesting. The visualization looks like coloured blobs in the shape wrapped around each other in the shape of the paragraph text blocks. He then uses clustering algorithms to graph collections of poems.

Ultimately he wants to graph poems from different periods and see how they cluster phonemically. He hopes that the clustering by phonostylistics will map onto the historical periods (with interesting exceptions.) In questions he mentioned that he is interested in quantitative aesthetics - be able to quantify beauty.

I've heard Plamondon now a number of times and he had established a unique and very strong program of research into poetry through its phonostylistics.

Features from Frequency: Authorship and Stylistic Analysis Using Repetitive Sound

C. W. Forstall (SUNY Buffalo) and W. J. Scheire (U. Colorado at Colorado Springs) looked for style and repetition. Like Plamondon they were looking at character and phoneme-level effects rather than words. For their first tests they used a corpus of 2 English novelists and 11 poets and tested their classifiers. Then they talked about another case, that of the provenance of the Iliad and the Odyssey.

Character level n-grams, they argued, work better than word n-grams especially for inflected languages (like Greek.)

They showed interesting plots of the n-grams from the books of their corpus. The poetic works (Iliad and Odyssey) had much broader plots.

What Can Be Made Computable in the Humanities?

Stephen Wolfram gave a keynote by videoconference. His mother was a philosopher at Oxford and he joked that he swore he would never go into the humanities. He talked about three projects: Mathematica.

Mathematica is for Wolfram for investigating questions that use formal representations. He talked about A New Kind of Science and how we model processes. What are the underlying theoretical rules and why do they follow the constructs of mathematics. With computation we have a paradigm for thinking about possible rules. What if we think of programs in the wild as systems that describe the world. He talked about cellular automata and showed how changing rules can change output. We used to think that to get something complicated you need complicated rules. Wolfram thinks you can get complexity from simple rules.

One of the things that arises from the study of processes is that there are other forms of modeling than traditional scientific mathematical modeling. He also talked about scoring models - those that give a lot for very little input score high.

Models in science are supposed to be predictive - they provide computational reducability. When we try to predict the results of a computation we are in competition with the computation. Can we predict without just running a system. Wolfram feels that as fancy as we are, computational irreducibility suggests that ultimately we can't predict systems without just running them. He connected this with determinism and free will. Even if we know a process is determined we also know that in many cases we cannot predict how it is determined and therefore be able to predict it. Something may be determined but it's future is unpredictable and therefore it is perceived as having free will.

A fundamental study of rules teaches about modeling in different disciplines. You find in simple rules the algorithms of other disciplines. You can mine the computational universe to find processes (rule sets) to use for particular purposes. He showed examples from WolframTones where you can sample the diversity of behaviour to explore musical form.

For Wolfram it is useful to ask to what extent it is possible to bring everything together, represent it, and make it computable. The results are Wolfram Alpha. The idea to make as much of the world's knowledge computable. Certain types of information are sufficiently available and well curated that Alpha can compute answers to queries. Eg. "what's the weather in chicago?" generates answers based on web resources. He showed a number of queries that work in Alpha. A lot of it depends on well curated data being available. Alpha also depends on the encoding of all sorts of models, rules and rules of thumb. They are now putting in more historical data, but it still doesn't handle humanities questions that well.

Wolfram then showed Mathematica and how you can use it for interactive exploration. For Mathematica they have done a lot of work on computational aesthetics - to encode heuristics for how to show answers. It is interesting how this works in Alpha - what they choose to see and how they choose to display it.

The language processing problem of Alpha is different from the normal AI one. They are trying to parse very short passages (queries) and to then do something interested with curated data. The short passages people type are not Google keywords, nor are they full human texts. Wolfram speculated that they could be close to the pidgeon that people might think in.

With Alpha they can expose all sorts of forms of computation. Anyone can try to poke it. The challenge for them is to start trying to make it support humanities inquiry. To do that the question will be whether we can model humanities phenomenon formally or if there are limits. The problems of the humanities offer challenges to computation. They raise questions about what's knowable.

In my little testing of Alpha I found it didn't perform that well even on knowable questions like "French for computer". Asking for the "meaning of truth in Plato" generated information about the craters Truth (on Venus) and Plato (on the moon.)

They way they get data is that they try to go back to primary sources. They then try to validate it. They use secondary sources, like the Wikipedia, for folk information and fame indexes (which is the most popular).

They currently have a 25% fall through rate of queries they don't answer.

Who's Who in Your Digital Collection? Developing a Tool for Name Disambiguation and Identity Resolution

Jean Godby (OCLC) and Patricia Hswe (UIUC) (with Judith Klavans (UMD), Hyoungtae Cho (UMD), Dan Roth (UIUC), Lev Ratinov (UIUC), and Larry Jackson (UIUC)) presented on disambiguation of names. They summarized the challenges of named entity recognition and are hoping to share an open source tool. See Extracting Metadata for Preservation (EMP) Project: http://www.oclc.org/research/activities/nameextract/default.htm .

One thing they have done is to compare their tool based on Dan Roth's work at the Cognitive Computation Group.

Discovering Latent Relations of Concepts by Graph Mining Approaches

Marco B�chler (University of Leipzig) presented on graphs. He talked about different types in the humanities from co-occurence graphs, citation graphs, to social network graphs.

I was running out of steam at this point so my notes don't reflect the last papers.

Round-Table Panel on Computer Science and the Humanities

We ended the conference with a panel on collaboration and research between computer science and the digital humanities. I was on the panel and presented a series of quotes and theses on collaboration:

Artists see science; they don�t understand it; they think it is brilliant. Scientists see art; they don�t understand it; they think it is dumb. (Beyond Productivity, p. 52)

Collaboration has to be built on respect, which is not politeness. Respect can come from dialogue, which involves listening to the other as they chose to present themselves.

The marked contrast between compensation levels for computer scientists and for artists, other things being equal, is significant for the intersection between IT and the arts inasmuch as it affects collaboration and education. Across organizations, and even departments in a university, compensation levels affect patterns of time use, expectations for research and for infrastructure, and so on. (Beyond Productivity, p. 53)

Collaboration is political. It is no easier than any other gathering and it is rarely between equals.
Collaboration is easier when supported externally. What entities on campus support interdisciplinary ventures? In my experience it is administrators above chairs. Chairs try to defend the disciplines, deans try to encourage interdisciplinary activities.

the arts establishment sometimes regards technology suspiciously, as if it lacks a worthy lineage or is too practical to be creative. This attitude was evident in early committee discussions, coming out most strongly in contrasting perspectives on the potential for creative practices within industry. (Beyond Productivity, p. 53)

Collaboration takes time. What are you not going to do? How important is it to collaborate? Pick carefully and be prepared to invest over time.

How should the pursuit of knowledge be organized, given that under normal circumstances knowledge is pursued by many human beings, each working on a more or less well defined body of knowledge and each equipped with roughly the same imperfect cognitive capacities, albeit with varying degrees of access to one another's activities? (Fuller, "On Regulating What Is Known", p. 145).

Collabortation is easier where the stakeholders share interests and goals. Two areas that seem to me to have promise are visualization and game design.
Collaboration is the norm. The question is how to manage it?

General Thoughts

The End of Digital Humanities I can't help thinking (with just a little evidence) that the age of funding for digital humanities is coming to an end. Let me clarify this. My hunch is that the period when any reasonable digital humanities project seemed neat and innovative is coming to an end and that the funders are getting tired of more tool projects. I'm guessing that we will see a shift to funding content driven projects that use digital methodologies. Thus digital humanities programs may disappear and the projects are shunted into content areas like philosophy, English literature and so on. Accompanying this is a shift to thinking of digital humanities as infrastructure that therefore isn't for research funding, but instead should be run as a service by professionals. This is the "stop reinventing wheel" argument and in some cases it is accompanied by coercive rhetoric to the effect that if you don't get on the infrastructure bandwagon and use standards then you will be left out (or not funded.) I guess I am suggesting that we could be seeing a shift in what is considered legitimate research and what is considered closed and therefore ready for infrastructure. The tool project could be on the way out as research as it is moved as a problem into the domain of support (of infrastructure.) Is this a bad thing? It certainly will be a good thing if it leads to robust and widely usable technology. But could it be a cyclical trend where today's research becomes tomorrows infrastructure to then be rediscovered later as a research problem all over.

The standard disclaimer An interesting feature of the talks was how the computer scientists apologized for not be humanists and vice versa. Everyone was aware that their talks crossed fields that most of us are not fully comfortable with.

Digital Humanities And Computer Science