These are my notes on the Dagstuhl on Computer Science and Digital Humanities. The Dagstuhl is a retreat centre for computer science. The timetable (behind a password) is at http://www.dagstuhl.de/wiki/index.php/TimeTable_CH The slides are at http://www.dagstuhl.de/mat/index.en.phtml?14301

These notes are being written live so they are full of problems and so on.

Monday, July 21, 2014

Christiane D. Fellbaum (Princeton University) started us. In Germany they don't have the same division between humanities and science so it is possible to imagine developing a common research agenda. The goal is to develop common research questions and directions. She ended with an interesting question about whether a probabilistic result would be interesting to humanists.

Chris Biemann (TU Darmstadt) led a discussion about failures and the gaps between Computing Science and the Humanities. A common complaint is that humanists just want the CS folk to be software engineers. The problem seemed to be stated in terms of a flow from CS to the humanities. I can't help wondering what CS wants to know from us humanists? Do they want to learn how to study themselves? Do they want to ask about the philosophy of their discipline?

Max Schich and then Crane argued that we can't really distinguish between CS and H - we are back to the 18th century when the questions bring us together, not training.

Biemann ended with "The workshop will solidify Computational Humanities as a field of its own and identify the most promising directions for creating a common understanding."

We then had a presentation by one of the leaders of the paleography workshop that is running in parallel. He made some useful comments about the absence of a ground truth in paleography compared to computer science. What is it like to work in a field where interpretation is the ground?

Sven ? talked about the role of computational models and whether they would work in the humanities. He also asked if there are ethical issues in data science. He asked what we consider the humanities?

Bettina Berendt talked about what the computational humanities might be. She is working out what the humanities are. It could be the �sciences of the human� or it could be �informatica humanistica� in the sense of humane informatics. She asked whether �the more we know, the better?� If we have more data do we have more knowledge? How to be a knowledge scientist after the Snowden revelations. She is also setting up a MA programme in DH.

Jana Diesner does cool work on ConText or "a) the construction of network data from natural language text data, a process also known as relation extraction and b) the joint analysis of text data and network data."

Anette Frank is doing work on computational analysis of narratives? They are doing alignment across comparable discourses. She is leading a working group on Computational Linguistics and Applied Linguistics in CLARIN-D. They are trying to do computational linguistics for digital humanities. In CL the methodology is much more important than the object of study. In the humanities the text is important.

Cathleen Kanter talked about her work on text analysis in the social sciences. Her work has developed around large amounts of newspaper articles and diachronic analysis. She talked about some interesting analysis of how wars are talked about.

Andrea Rapp is working on a cool project (with others) called ePoetics which is inspired by Ramsay's idea of algorithmic criticism.

Kurt Gaertner talked about a dream of a dictionary network that would bring together different dictionary projects related to Germanic languages.

Szymon Rusinkiewicz talked about projects that are dealing with material culture. They are trying to help archaeologists with matching fragments of frescos from Thera.

Scharloth is interested in what language use reveals about society. He also does research ethics for eHumanities. He framed a great question, �when is a statistical model an explanation?� He also was interested big data ethical issues.

Tuesday, July 22nd

Alexander Mehler and Andy L�cking: On Covering the Gap between Computation and Humanities

He mentioned an article by Amancio et. al. Identification of literary movements using complex networks to represent texts. Mehler started the paper on the issue of going beyond exploratory text mining. He presented a generic procedure that went from hypothesis formation to model selection to forms of evaluation of models.

He talked about a semiotic / sign-theoretical gap. Data science operates on strings without assuming that they are signs. The humanities deals with signs or meaning. Mehler then showed how you can get very different networks from the same document by choosing different pre-processing and so on.

Andy talked about an empirical approach. If CH is an empirical science (Poser 2012) then you have a stable theory, experimental laws, observational statements.

I couldn't help thinking that we were hearing another theory of how the humanities could be made into a science. The humanist responds that there is a history of attempts to make the humanities empirical.

For real power of empirical sciences is deductive-nomological explanation. Not sure what that is.

How would this connect to Moretti's discussion of the difference between explanation and interpretation.

He gave an example of the defenestration of Prague. What is the observable event? Does the notion of "general law" hold? How far back into the past to the antecedent conditions reach?

He then mentioned the idea of idiographic and nomothetic. The humanities deals with singularities that serve understanding, but don't lead to general laws. By contrast CH methods are nomothetic, since they are designed to detect something general.

Andy summarized by listing a number of gaps:

The humanities are concerned with the understanding of the specific.
CH is often explorative - they have no explanatory force. They don't test hypothesis.
Andy talked about an epistemological gap and the value of falsification. What we need are hypotheses from the humanities that can be falsified.

They proposed some ways forward:

Experimental methods for CH
Hermeneutic technologies
Hermeneutics of technology

Alexander Mehler made an important point about how many people are using data methods without looking closely at them. Minor parametric changes can have major change in the results.

Max made an interesting point about "event disciplines" and "law disciplines". He argued that the ideographic and nomothetic distinction doesn't make sense any more. We all have too much data. I'm not sure that is the point of the indiogaphic - it is both the discipline of unique events and the disciplines for unique events. A interpretation of a novel is often about what is unique about the novel but it is also about making the novel unique.

I argued that we have to have a meta-theory about the difference between science and the humanities. This is another gap - the understanding of what the humanities are from within the humanities is different from the view from the outside. Within the humanities we have a dialogical view of what we are doing. Our research is always an intervention in a conversation and the rhetoric matter. We have very different rhetorical traditions right down to what we call our research.

Listening to this and other talks leads me to the following hypothesis about what scientists think the humanities are:

There is a humanities research method which, if we can understand it, we could model it and automate parts.
The method starts with research questions which, for which we can generate testable hypotheses. Then we can operationalize answering the questions.
The humanities works on understanding unique events (idiographic) while the sciences work on repeatable events that can generalized into laws that allow you to reliably predict future such events.

Put this way, there are lots of problems with this characterization:

We don't have a method or even a set of agreed methods. Most humanists don't think about or in terms of method. Those that do are questioning the idea of method. What we have are various (often unexamined) practices that are loosely followed depending on the circumstance.
We don't always want to automate our practices. Some things we get attached to and resist automating because it is a practice that has become symbolic or one that is enjoyable.
We don't start with research questions and when forced to come up with one that can be formalized it is merely a starting point in a dialogical (or interactive) research process. What is more typical is a cycle of wide reading, and focused following of threads.
Some humanists (historians) work on events that are unique. Some are looking to generalize. Many are interested in appreciating the uniqueness of an event or artifact. This is uniqueness for not of. It is about imagining the human experience involved in the making and reception of this.

Gerhard Heyer: Challenges for Computational Humanities

Heyer started by talking about the research process:

Generating a research question
Operationalizing the question
Proving/verifying/evaluating results

Then the question is whether we can just take off-the-shelf computer science? The case is different if we need innovative science. Heyer suggested that we have the eHumanities that bring together the Computational Humanities and the Digital Humanities which together are doing something new. He had a nice venn diagram. CH is to computer science what DH is to the humanities.

Heyer discussed what it would mean for Computational Humanities to be a new discipline. We have to develop new methods. We have big data issues. We have issues around research infrastructures.

He talked about salience and "weak signals". We often want, in the humanities, the weak signals or the anomalies. He showed a tool for tracking low frequency terms through the news.

The dangers of people applying technologies they don't understand were mentioned and the sorry story of quantitative history in the 1970s. When methods are poorly applied it leads to a field giving up on the methods. This creates a catch 22 where we can't use the methods unless we understand them and we can't understand them without using them. Perhaps it isn't so bad, but we do have a problem with colleagues who are not going to invest in methods unless they can see useful results on their materials.

Anette Frank: Linguistics, Machine Learning and Digital Humanities

Distributional semantics = word is characterized by company it keeps. One can build statistical models of semantics based on desribution of words rather than . She walked through how we can do different things:

How to learn to recognize name bearing classes in a domain. She talked about how machine learning can improve NER. You can also use a classifier on the extracted names to tell you what domain you are in. An astronomy paper would have mostly entities from the class of astronomical objects.
Classifying generic expressions. Generic expressions express knowledge about the world like "a horse can sleep while standing." A specific expression doesn't tell us a general truth as in "a horse attacked me." The question is how to distinguish generic vs. non-generic phrases. They tried a number of syntactic and semantic features and found a particular mix. Machine learning can find the best mix and then you can inspect the results. You can learn and tailor the machine algorithm to the phenomenon you want to explore.

Short Talks

We then went back to short talks.

Max Schich started with a point about how humanists and technologists should work together. Sort of like pair programming. He argued that we need to study culture the way biology studies life. He wants to do cultural science. He argued it wasn't a matter of a scrum, but of a cloud. You can't engineer the solution because we don't know what the solution is.

Susan Schreibman talked about doing digital editions. Now she is working on a project "Downstream for the Digital Humanities" which tries to track discussions of certain issues in DH.

David Smith is at the Northeastern University NUlab for texts, maps and networks. He is in computer science. He is looking at documents and trying to find representations to then map a relational structure. He is working on machine learning that is based on what people need to do and don't depend on massive tagging and POS. He applies this to tracking poetry through newspapers and tracking policy ideas.

U. Schmid is at the University of Bamberg where they have a BA and MA in applied computer science. She talked about White-Box learning. In Black Box learning you don't know how the machine learned. In WBL you build a system that learns and can tell you what/how it learned. They try to build both black and white boxes and try to see what is lost in the white box. She wants to see inductive functional programming.

Caroline Sparleder is at Trier and raised questions about what the research goals are in the humanities. What aspects of Hum research can be formalised/automatised? How do we ensure the trustworthiness of results?

Manfred Thaller talked about the Clio system that he has been extending to support historians. The idea is to insert software between the historian and their sources. He is concerned with the way we have to keep reinventing the same wheels.

He argued the heresy that it is impossible to write provably correct TEI software as the TEI has no underlying coherent model.
He argued that the humanities computing science need higher order objects defined independently.
Humanities information is inconsistent, contradictory and fuzzy. We need to figure out how to deal with that.
The humanities data can be interpreted as n different sets of information by n different cognitive agents. We need to look more at how information is represented.

Claire Warwick talked about how to link humanists and computer scientists. She is interested in user centred design and community engagement.

Katharina Zwieg does graph theory. She analyzes big networked data. She watches people playing games to generate networks of paths. She has found that there are no ways to discuss successful paths.

Working Groups

We then had our working groups and a meeting where the groups shared ideas. In my groups we practiced a close reading of the CSEC slides. We used a pad to keep notes at: https://pad.systemli.org/p/impSoc5

Some of points that we came up with include:

The close reading of Snowden documents can bring CH and DH folk together using our respective skills. We need the CH folk to read the software represented and the DH folk to read the documents as represented. Of course, it turns out we are both good at both.
It turns out that the reading of these documents to understand what the NSA (and others) are doing is compelling. It is a good way to get people engaging with the issues - both the issues of ethics and the issues of what they are really doing or saying they are doing.
We can imagine how we can turn such forensic/diplomatic reading into problems for students. Could we teach people the ethics of data science through the interpretation of these documents?
What the CSEC slides show is a process not unlike what we do (or want to do). This raises the issue of what the difference is between DH and SIGINT? What makes one use of methods ethical or not?
If the big data processes revealed by the Snowden leaks show good (or at least interesting) examples of big data interpretation (or analysis) then can we learn from them? Would it be ethical to copy the tools or processes revealed?
Ultimately we have to ask how surveillance is different from care for the other? Both are a form of knowing an other - how is the other different and how is the knowing different?

Chris Biemann: Machine Learning for Uncomputer Scientists

Chris' talk was just at the right level for me and he tolerated my random questions. It covered:

General concepts to Machine Learning Machines don't learn the way we do. They can be taught very specific, narrowly defined tasks. They don't have motivations. For a machine finding a giraffe and finding a non-giraffe are the same thing.

Features A feature are characterizing instances. A computer can decompose instances provided to find features that are shared. Features can be complex. The secret in ML is in the features, not the algorithm. A feature-function takes in an input, like a word, and produces an output like "true" when the word matches.

Example He gave an example of "Dagstuhl is located in Germany" and how different features could be applied to words. A feature function looking for a city or country would come back true for the first and last word. He gave a number of examples of cool things one can do with words. One can use pre-processing steps as functions that produce, for example, a lemma. Another ML can become another feature-function. What matters is that the function is reliable - it is deterministic.

He talked about distributional semantics - we can know a word by the company it keeps. For example, for disambiguation one might want to look at surrounding words. You can also characterize a data point by the contents (ie. a document by the word frequencies - TF*IDF). This is used for documents. It is much easier to learn from vectors than from other types of output.

Machine Learning Algorithms: A ML algorithm needs to be told the features. When learning a classifier, every feature introduces a parameter than has to be learned. The more parameters must be estimated, the more training data we need to do this reliably. There are also other problems like complex features or sparse features (features that are never found, even if they are great.) Then there are features that are dependent.

Supervised learning: you know what you are looking for - you know the buckets and you want the computer to classify them. With supervised learning you have your features and you train on a training set and the computer generates a black box that classifies. You train on one set and test on another. Some of the issues that can cause trouble are lack of training data, too few features, bad features, bad algorithms, and diminishing returns.
Unsupervised: you don't know what you are looking for - you want the computer to figure out what the clusters are. I then wondered what the difference was between something like PCA and unsupervised learning?

A neat thing is that one can also get the computer to generate features. It is called feature selection where you have too many features and the machine selects those that do the heavy lifting. Or you can induce features by randomly generating them and genetically selecting those that work. This struck me as an interesting direction.

He has a cool annotation tool for training. See WebAnno: http://www.lt.informatik.tu-darmstadt.de/de/software/

There is no task that a computer or a human can solve error-free. Before we complain about ML we should realize that we aren't that reliably either. It was also pointed out that the psycholinguists are discovering that a lot of children's language learning seems to be unsupervised learning (with motivation.)

What is ML bad at:

Solving YOUR task out of the box.
Knowing what your task is when you don't tell it
Tasks where humans can't reach agreement
Finding needles in a haystack - you need balanced bins

What ML is good at (with training) includes:

Some forms of text processing: document clustering, classification, information extraction, semantic text similarity
Image processing: object detection, PCR

In general ML is good at capturing the ordinary. It shows you what you know.

Where do decision trees come in?

Wednesday: July 23

Greg Crane: Wissenschaft, Humanism and the potential of Computational Humanities

Greg Crane started his talk by hoping that he wouldn't say anything new, but framing it differently.

Geisteswissenschaften - in German this covers all sorts of activities with out separating out the humanities. Administratively, the big German granting agencies make a point of funding the humanities. They also are willing to fund fields across the humanities and sciences. He argued that Germany is, for this reason, the best place to fund things. (I would argue that Canada is also great, but that is another issue - he is specifically comparing things to the US.)

He talked about how STEM deprioritizes the arts and humanities. Even in Germany they have MINT (?) that also deprioritizes.

He argued that "the Humanities exist only insofar as they are internalized in human minds." By this I think he is suggesting that the humanities is not just about human expression, but is also sustained or defined by people. There would be no humanities without people while scientific laws should continue without people to think them. We need public engagement for humanism to thrive.

One of the big problems we deal with in the humanities is "What do you do with imperfect information?" He talked about Tiananmen square and the fall of the Berlin wall. We have imperfect info about Tiananmen (June 1989) and perfect information about the Leipzig demonstrations that led to the collapse (October 1989). The DDR folk knew about what happened in China and how the Chinese stopped it. (See Mary Sarotte, ''The Collapse: The Accidental Opening of the Berlin Wall".)

To some extent knowing everything about what happened in Leipzig doesn't necessarily help. Crane argued that the fall is the end of a process that starts in Trier in the 4th century with the first Christians martyring other Christians. One of the reasons we have the humanities is to mediate a dialectic with the past. Successful countries reinterpret the past or use the past to understand their choices. It is "To remember, to critique, to praise, to blame and (insofar it is possible) to redeem."

What are the limits of information? What would it mean to have powerful formal models representing culture if it was only available to specialists? Again, I think Greg is leading to the importance of public engagement. We need not only information, but "augmented consciousness". He wants to develop people's understanding of human actions and expressions. One way of thinking of our mission is to let people look at a manuscript and help them understand it.

I wonder if we are seeing two ways forward for the humanities. One is to become a science and one is to become a public. Computational humanities, as described by Alex and Andy, would develop a scientific humanities. Greg's view is to crowdsource the humanities. I suspect they are going to be reconciled by the end of the week.

Crane talked about Edward Everett and his talk justifying European colonialism. Everett is what happens when you don't have humanistic criticism. Likewise, where were the humanists during WW I. Humanists and classicists where part of the problem teaching about the

He then asked us "What do you imagine for 2019 or 2114? What values do you bring? Are our institutionalized humanities advancing national or nationalist identity? Does your work advance an interest group? Do you have a state religion?"

Humanist education requires:

A critical engagement with the past
Internalize knowledge that augments how we understand the past
A restless and optimistic curiosity about other cultures and perspectives

Then there are things to worry about like academic scholasticism where we are attached to our knowledge (and not interested in new ways of knowing) and where we are attached to virtuoso performance.

We need to model and explain the transactional costs of getting information. We need to drop the barriers to access and techniques like topic modeling need not outperform an expert to be useful.

We need to stop focussing on virtuoso performance that is inaccessible. He made an ethical argument about the professoriat. If we can't teach our research there is a problem. It is a scholastic fallacy that experts decide what is and is not important. We are all public servants and we need to maintain a public consensus about the value of our work. Our internal challenge is how to do more research, but there is dramatic change.

Don't expect really successful humanists to do really new work. If they are successful why would they change.

Grand challenges:

Changing scales of research: we have to move to macro-analysis, but also new forms of micro-analysis (e.g. computational and corpus linguistics)
How can we hope the large scale to everyone?
Changing communities of scholarship - citizen and student science needs to be explored. We need new transformative collaborations across boundaries.
New ecologies for intellectual exchange: micro and macro publications. Integration of machine actionable data. We need to work out how to open data and the business models for this.
Who is our audience? If we put it in JSTOR we are limited to those paying for it. Perhaps the audience should be the world. They are betting on the idea of lots of poor data improved over really good stuff in small drops. He talked about the J curve effect where you take the risk of a drop before improvement.

Think in terms of what matters and fight for that!

I asked about how his vision of Computational Humanities matches the other visions I've heard. He argued that he is presenting the ends for which other visions are means and tactics.

Bettina asked a good question about the incentives of money. The money doesn't want citizen science or teaching kids - they want applied research that has economic impact. Criticism doesn't pay.

Manfred Thaller was critical of Crane's vision. He feels that the humanities lack a vision of what they are doing. The millions of classics students don't really drive things.

I can't help worrying about the connection between developing a critical public and crowdsourcing the humanities. Criticism is seen now as political and not public work (at least in North America.)

Excursion and Dinner

We had a nice excursion to Trier and dinner in a winery. Loretta Auvil gave me all sorts of good ideas about text mining tools:

RapidMiner: http://rapidminer.com/ - text mining visual programming environment
Orange: http://orange.biolab.si/ - a visual programming environment with data mining features
Tableau: http://www.tableausoftware.com/ - a commercial tool which is very flexible for visualization
Topic Modeling: http://www2.research.att.com/~kshirley/lda/index.html
https://github.com/mimno/jsLDA - David Mimno�s javascript
http://jason.chuang.ca/research/ - Termite
Jigsaw - Georgia Tech
Gate - University of Sheffield
Carrot2 Workbench - pulls data from web or from solr for analysis

Chris Biemann also pointed me to the neat tools his team has developed at http://www.lt.informatik.tu-darmstadt.de/de/demos/

Thursday, July 24th

David Mimno: Transparency

David Mimno talked about the need for different types of transparency. It is important that tools show their workings in different ways that can help humanists. They can expose their objectives, their implementations, the variables and so on.

He puts his code up at GitHub Gist https://gist.github.com/mimno/ with documentation. This allows people to run it, look at the code, read the documentation. He thus showed

I asked about how we can show the iterative processes. He talked about how topic modelling is intensely iterative as you edit your vocabulary. He went further and said that for humanists the iterative trimming of words is interesting as a way of thinking about the text, even if the results are not spectacular.

He showed three really neat examples starting with a simple bit of code that fits a line to generated points. The line is the model.

Szymon Rusinkiewicz: Reassembling the Thera Wall Paintings

Rusinkiewicz talked about the neat Thera project where they have lots of fresco fragments that they want to document and reassemble. They used different scanners - a 3D laser range scanner. They also used a flatbed in an innovative way to get a 3D scan. They created software that automated a pipeline. He talked about a ribbon matching process for finding fits and he showed an interface for trying to match fragments.

Brian Joseph: The Web of Language

He started by pointing out that language(s) drive the web which is why linguistics is so important. What could the web mean from a language perspective:

a place for public and private use of language - perhaps even a place for dialogue and good
a place for human expression
a place for data to reside
a place where data is created - the web is data itself

He reminded us of a seminar paper by Christiane Fellbaum on "What can Computational Linguistics (not) do for the Humanities?"

I would add that the web can also be a place of play and that there are hybrid places (if place is the right metaphor) which are both web and real.

Christopher Brown: Center for Digital Research in the Humanities

Christopher gave us a survey of digital humanities at Ohio. They have discovery themes that include data analytics. There is a lot of government funding in big data analytics. There is a Advanced Computing Center for Arts and Design. They have a Global History of Health project. There is an emphasis on e-learning - they are creating open learning materials and MOOCs.

Joint Session

We had a joint session with the other seminar on paleography and computer vision. Some issues that came up from the Paleography group:

How deep does a paleographer use a tool (black box) which they don't fully understand. Can they trust the expertise of others. We need to distinguish between what they need to use a tool and the deep knowledge needed to make the tool.
We don't just have machine black boxes, we have people black boxes - connoisseurship is a black box
There are power differences between CS and Humanists that condition the conversation.

The Reflection group discussed the CS/DH relationship. What are the road blocks?

No appreciation or understanding of the other
Mutual distrust
Different methods, theories, traditions

What might work?

Collaboration should be more than additive
Get out of comfort zone

Typology of projects:

Databases exist - we can adapt to new issues
Use databases of long-term, creative collaboration with more ambitious goals
Experienced humanists learn CS tools/methods/ and then work independently

The Mixed Bag group discussed:

Vagueness - how is it discussed in the humanities? How can it be modelled or represented?
Quantitative hermeneutics - they felt there may be emerging fields that cross fields

I presented on behalf of the Impact group. I presented (among other things) an outline of a discussion of ethics and big data:

How should we do big/data/science in light of the Snowden revelations?
- What responsibilities, gifts and roles do we have as researchers?
Standard position: Not my business - just a tool/technique
- What do researchers say about their developments?
- What can we learn from philosophy of technology
- Mining is about discriminating � one cannot avoid legal and ethical issues
- Beware inventing the other � we need to ask about our activities
Need for public dialogue not decision tree
- Need for thick description � story telling
- Need to teach people across humanities/data science
- Can a project in the reading of the Snowden leaks become a praxis of dialogue?

There seemed to be a lot of anxiety about the relationship between computer scientists and humanists. It is great to hear CS folk worrying about this. I think we (humanists) worried about it alone - now there is a dialogue.

Susan Schriebman: 1916 Project

Susan talked about various projects related to the upcoming centenary of the Easter Uprising of 1916 in Ireland.

Contested Spaces is part of a Mellon project trying to model how virtual worlds can be used for research. They therefore need to work in common with others in the consortium. They are trying to find light-weight tools for doing the work that others could use. They are modeling a block of Dublin where a battle took place so that questions can be asked about the event.

She sees herself as doing something like a digital edition where you have the model/edition and the evidence used to make it.

Another project is the Letters of 1916 project that is trying show what life was like in that year. Here they crowdsourced the project to get letters donated and encoded. Crowdsourcing changes how you do the project - among other things you build the tool at the beginning.

She ended with some examples of social network analysis and text mining.

Greg Crane and Manfred Thaller Discussion

It was clear at the end of Greg's talk that there was a disagreement between Greg and Manfred so a discussion was organized.

Greg started by making a few points. He commented on how they might agree violently. His one central idea is that we are looking a radical new and deeply traditional form of education that happens to be 200 years old. The German tradition is about instructor and student serving Wissenschaft not each other. The American idea is of class formation. He went further and argued that without classics the other humanities are exposed. He pointed out that if research is not integrated into pedagogy then people will ask about the need for research. He talked about how students in STEM are integrated into labs and get to do pseudoresearch. In the humanities they are told they have to wait until they have been stuffed with information.

This seems very different than the teaching critical thinking argument he presented last time. This argument is more about bringing undergrads into research.

Manfred agreed with the value of putting stuff up, but feels what Greg describes is a side-effect. It is a welcome side-effect. The humanities up to the 1960s the humanities provided an implicit vision that by looking at the past one could help the nation. The humanities studies very small amounts of information - now we can look at much larger data sets. Someone should look into how changes in scale change different fields. We may have a Kuhnian paradigm change.

He hopes that the humanities can establish that they are those that understand what happens in 1 million books. The humanities will have a better future if we can explain a mission/vision that goes beyond the incomprehensible. He believes that a part of that vision should be that the humanities teach people not just to answer certain questions, but to ask new ones.

Tony Davies argues in Humanism that "All humanisms, until now, have been imperial. They speak of the human in the accents and the interests of a class, a sex, a race, a genome. Their embrace suffocates those whom it does not ignore." I wonder if their arguments are really for the humanities or for the ethical pursuit of knowledge. Do we need to save the humanities or ditch it for something better.

David argued something similar for CS - should we be trying to save CS or ditch it for something else.

Friday, July 24th

Last night I read the position paper of one of the organizers, Christiane Fellbaum. In the paper she says:

[A] political scientist colleague believes that we can help him to better understand the concepts of �self-determination� and �nationhood� and their changes throughout time and space by means of sophisticated textual analysis. I have not (yet) managed to persuade him that such lofty goals are beyond our computational wizardry and that we are not yet prepared to make meaningful contributions to the deep understanding of these difficult notions, which political theorists have struggled with for a long time.

While she is right that we can't automatically generate solutions of concepts, I think this is the sort of ambitious challenge that we need to work towards and her questions. I also think the bar for interpreters is lower than for computer scientists. A tool for "better understanding" doesn't have to solve a problem or prove/disprove a hypothesis - it needs to suggest lines of inquiry. I would call this the hermeneutic challenge - how can computers help us better understand concepts in the process of thinking through. I don't expect computing to end discussion, I want it to assist discussion. My sense of the questions we need to answer (based on Fellbaum's) are:

How can we characterize hermeneutical practices so that we can identify where computers can help?
What does it mean to "understand better"? Can we imagine ways computers can assist understanding rather than replace understanding?
What methods, tools, and practices are promising? How are they limited? Where could improvement make a real difference?
What issues are there with the datasets? If we want to do large scale analytics on historical datasets, how do they need to be structured? Where there are IP issues, can we define non-consumptive data that can be shared?
How should tools and data be shared and documented in results? How can the tools help one contribute to a larger dialogue?

I later talked with Cathleen Kantner who pointed me to a neat project she is part of at the University of Stuttgart on tracking complex concepts through newspapers in different languages. The site for the project is here (in German). There is a paper in English that explains it titled Towards a Tool for Interactive Concept Building for Large Scale Analysis in the Humanities (PDF).

Dagstuhl On Computer Humanities Digital Science