Dagstuhl On Computer Humanities Digital Science
These are my notes on the Dagstuhl on Computer Science and Digital Humanities. The Dagstuhl is a retreat centre for computer science. The timetable (behind a password) is at http://www.dagstuhl.de/wiki/index.php/TimeTable_CH The slides are at http://www.dagstuhl.de/mat/index.en.phtml?14301
These notes are being written live so they are full of problems and so on.
Monday, July 21, 2014
Christiane D. Fellbaum (Princeton University) started us. In Germany they don't have the same division between humanities and science so it is possible to imagine developing a common research agenda. The goal is to develop common research questions and directions. She ended with an interesting question about whether a probabilistic result would be interesting to humanists.
Chris Biemann (TU Darmstadt) led a discussion about failures and the gaps between Computing Science and the Humanities. A common complaint is that humanists just want the CS folk to be software engineers. The problem seemed to be stated in terms of a flow from CS to the humanities. I can't help wondering what CS wants to know from us humanists? Do they want to learn how to study themselves? Do they want to ask about the philosophy of their discipline?
Max Schich and then Crane argued that we can't really distinguish between CS and H - we are back to the 18th century when the questions bring us together, not training.
Biemann ended with "The workshop will solidify Computational Humanities as a field of its own and identify the most promising directions for creating a common understanding."
We then had a presentation by one of the leaders of the paleography workshop that is running in parallel. He made some useful comments about the absence of a ground truth in paleography compared to computer science. What is it like to work in a field where interpretation is the ground?
Sven ? talked about the role of computational models and whether they would work in the humanities. He also asked if there are ethical issues in data science. He asked what we consider the humanities?
Bettina Berendt talked about what the computational humanities might be. She is working out what the humanities are. It could be the “sciences of the human” or it could be “informatica humanistica” in the sense of humane informatics. She asked whether “the more we know, the better?” If we have more data do we have more knowledge? How to be a knowledge scientist after the Snowden revelations. She is also setting up a MA programme in DH.
Jana Diesner does cool work on ConText or "a) the construction of network data from natural language text data, a process also known as relation extraction and b) the joint analysis of text data and network data."
Anette Frank is doing work on computational analysis of narratives? They are doing alignment across comparable discourses. She is leading a working group on Computational Linguistics and Applied Linguistics in CLARIN-D. They are trying to do computational linguistics for digital humanities. In CL the methodology is much more important than the object of study. In the humanities the text is important.
Cathleen Kanter talked about her work on text analysis in the social sciences. Her work has developed around large amounts of newspaper articles and diachronic analysis. She talked about some interesting analysis of how wars are talked about.
Kurt Gaertner talked about a dream of a dictionary network that would bring together different dictionary projects related to Germanic languages.
Szymon Rusinkiewicz talked about projects that are dealing with material culture. They are trying to help archaeologists with matching fragments of frescos from Thera.
Scharloth is interested in what language use reveals about society. He also does research ethics for eHumanities. He framed a great question, “when is a statistical model an explanation?” He also was interested big data ethical issues.
Tuesday, July 22nd
Alexander Mehler and Andy Lücking: On Covering the Gap between Computation and Humanities
He mentioned an article by Amancio et. al. Identification of literary movements using complex networks to represent texts. Mehler started the paper on the issue of going beyond exploratory text mining. He presented a generic procedure that went from hypothesis formation to model selection to forms of evaluation of models.
He talked about a semiotic / sign-theoretical gap. Data science operates on strings without assuming that they are signs. The humanities deals with signs or meaning. Mehler then showed how you can get very different networks from the same document by choosing different pre-processing and so on.
Andy talked about an empirical approach. If CH is an empirical science (Poser 2012) then you have a stable theory, experimental laws, observational statements.
I couldn't help thinking that we were hearing another theory of how the humanities could be made into a science. The humanist responds that there is a history of attempts to make the humanities empirical.
For real power of empirical sciences is deductive-nomological explanation. Not sure what that is.
How would this connect to Moretti's discussion of the difference between explanation and interpretation.
He gave an example of the defenestration of Prague. What is the observable event? Does the notion of "general law" hold? How far back into the past to the antecedent conditions reach?
He then mentioned the idea of idiographic and nomothetic. The humanities deals with singularities that serve understanding, but don't lead to general laws. By contrast CH methods are nomothetic, since they are designed to detect something general.
Andy summarized by listing a number of gaps:
They proposed some ways forward:
Alexander Mehler made an important point about how many people are using data methods without looking closely at them. Minor parametric changes can have major change in the results.
Max made an interesting point about "event disciplines" and "law disciplines". He argued that the ideographic and nomothetic distinction doesn't make sense any more. We all have too much data. I'm not sure that is the point of the indiogaphic - it is both the discipline of unique events and the disciplines for unique events. A interpretation of a novel is often about what is unique about the novel but it is also about making the novel unique.
I argued that we have to have a meta-theory about the difference between science and the humanities. This is another gap - the understanding of what the humanities are from within the humanities is different from the view from the outside. Within the humanities we have a dialogical view of what we are doing. Our research is always an intervention in a conversation and the rhetoric matter. We have very different rhetorical traditions right down to what we call our research.
Listening to this and other talks leads me to the following hypothesis about what scientists think the humanities are:
Put this way, there are lots of problems with this characterization:
Gerhard Heyer: Challenges for Computational Humanities
Heyer started by talking about the research process:
Then the question is whether we can just take off-the-shelf computer science? The case is different if we need innovative science. Heyer suggested that we have the eHumanities that bring together the Computational Humanities and the Digital Humanities which together are doing something new. He had a nice venn diagram. CH is to computer science what DH is to the humanities.
Heyer discussed what it would mean for Computational Humanities to be a new discipline. We have to develop new methods. We have big data issues. We have issues around research infrastructures.
He talked about salience and "weak signals". We often want, in the humanities, the weak signals or the anomalies. He showed a tool for tracking low frequency terms through the news.
The dangers of people applying technologies they don't understand were mentioned and the sorry story of quantitative history in the 1970s. When methods are poorly applied it leads to a field giving up on the methods. This creates a catch 22 where we can't use the methods unless we understand them and we can't understand them without using them. Perhaps it isn't so bad, but we do have a problem with colleagues who are not going to invest in methods unless they can see useful results on their materials.
Anette Frank: Linguistics, Machine Learning and Digital Humanities
Distributional semantics = word is characterized by company it keeps. One can build statistical models of semantics based on desribution of words rather than . She walked through how we can do different things:
We then went back to short talks.
Max Schich started with a point about how humanists and technologists should work together. Sort of like pair programming. He argued that we need to study culture the way biology studies life. He wants to do cultural science. He argued it wasn't a matter of a scrum, but of a cloud. You can't engineer the solution because we don't know what the solution is.
Susan Schreibman talked about doing digital editions. Now she is working on a project "Downstream for the Digital Humanities" which tries to track discussions of certain issues in DH.
David Smith is at the Northeastern University N Ulab? for texts, maps and networks. He is in computer science. He is looking at documents and trying to find representations to then map a relational structure. He is working on machine learning that is based on what people need to do and don't depend on massive tagging and POS. He applies this to tracking poetry through newspapers and tracking policy ideas.
U. Schmid is at the University of Bamberg where they have a BA and MA in applied computer science. She talked about White-Box learning. In Black Box learning you don't know how the machine learned. In WBL you build a system that learns and can tell you what/how it learned. They try to build both black and white boxes and try to see what is lost in the white box. She wants to see inductive functional programming.
Caroline Sparleder is at Trier and raised questions about what the research goals are in the humanities. What aspects of Hum research can be formalised/automatised? How do we ensure the trustworthiness of results?
Manfred Thaller talked about the Clio system that he has been extending to support historians. The idea is to insert software between the historian and their sources. He is concerned with the way we have to keep reinventing the same wheels.
Claire Warwick talked about how to link humanists and computer scientists. She is interested in user centred design and community engagement.
Katharina Zwieg does graph theory. She analyzes big networked data. She watches people playing games to generate networks of paths. She has found that there are no ways to discuss successful paths.
We then had our working groups and a meeting where the groups shared ideas. In my groups we practiced a close reading of the CSEC slides. We used a pad to keep notes at: https://pad.systemli.org/p/impSoc5
Some of points that we came up with include:
Chris Biemann: Machine Learning for Uncomputer Scientists
Chris' talk was just at the right level for me and he tolerated my random questions. It covered:
General concepts to Machine Learning Machines don't learn the way we do. They can be taught very specific, narrowly defined tasks. They don't have motivations. For a machine finding a giraffe and finding a non-giraffe are the same thing.
Features A feature are characterizing instances. A computer can decompose instances provided to find features that are shared. Features can be complex. The secret in ML is in the features, not the algorithm. A feature-function takes in an input, like a word, and produces an output like "true" when the word matches.
Example He gave an example of "Dagstuhl is located in Germany" and how different features could be applied to words. A feature function looking for a city or country would come back true for the first and last word. He gave a number of examples of cool things one can do with words. One can use pre-processing steps as functions that produce, for example, a lemma. Another ML can become another feature-function. What matters is that the function is reliable - it is deterministic.
He talked about distributional semantics - we can know a word by the company it keeps. For example, for disambiguation one might want to look at surrounding words. You can also characterize a data point by the contents (ie. a document by the word frequencies - TF*IDF). This is used for documents. It is much easier to learn from vectors than from other types of output.
Machine Learning Algorithms: A ML algorithm needs to be told the features. When learning a classifier, every feature introduces a parameter than has to be learned. The more parameters must be estimated, the more training data we need to do this reliably. There are also other problems like complex features or sparse features (features that are never found, even if they are great.) Then there are features that are dependent.
A neat thing is that one can also get the computer to generate features. It is called feature selection where you have too many features and the machine selects those that do the heavy lifting. Or you can induce features by randomly generating them and genetically selecting those that work. This struck me as an interesting direction.
He has a cool annotation tool for training. See Web Anno?: http://www.lt.informatik.tu-darmstadt.de/de/software/
There is no task that a computer or a human can solve error-free. Before we complain about ML we should realize that we aren't that reliably either. It was also pointed out that the psycholinguists are discovering that a lot of children's language learning seems to be unsupervised learning (with motivation.)
What is ML bad at:
What ML is good at (with training) includes:
In general ML is good at capturing the ordinary. It shows you what you know.
Where do decision trees come in?
Wednesday: July 23
Greg Crane: Wissenschaft, Humanism and the potential of Computational Humanities
Greg Crane started his talk by hoping that he wouldn't say anything new, but framing it differently.
Geisteswissenschaften - in German this covers all sorts of activities with out separating out the humanities. Administratively, the big German granting agencies make a point of funding the humanities. They also are willing to fund fields across the humanities and sciences. He argued that Germany is, for this reason, the best place to fund things. (I would argue that Canada is also great, but that is another issue - he is specifically comparing things to the US.)
He talked about how STEM deprioritizes the arts and humanities. Even in Germany they have MINT (?) that also deprioritizes.
He argued that "the Humanities exist only insofar as they are internalized in human minds." By this I think he is suggesting that the humanities is not just about human expression, but is also sustained or defined by people. There would be no humanities without people while scientific laws should continue without people to think them. We need public engagement for humanism to thrive.
One of the big problems we deal with in the humanities is "What do you do with imperfect information?" He talked about Tiananmen square and the fall of the Berlin wall. We have imperfect info about Tiananmen (June 1989) and perfect information about the Leipzig demonstrations that led to the collapse (October 1989). The DDR folk knew about what happened in China and how the Chinese stopped it. (See Mary Sarotte, ''The Collapse: The Accidental Opening of the Berlin Wall".)
To some extent knowing everything about what happened in Leipzig doesn't necessarily help. Crane argued that the fall is the end of a process that starts in Trier in the 4th century with the first Christians martyring other Christians. One of the reasons we have the humanities is to mediate a dialectic with the past. Successful countries reinterpret the past or use the past to understand their choices. It is "To remember, to critique, to praise, to blame and (insofar it is possible) to redeem."
What are the limits of information? What would it mean to have powerful formal models representing culture if it was only available to specialists? Again, I think Greg is leading to the importance of public engagement. We need not only information, but "augmented consciousness". He wants to develop people's understanding of human actions and expressions. One way of thinking of our mission is to let people look at a manuscript and help them understand it.
I wonder if we are seeing two ways forward for the humanities. One is to become a science and one is to become a public. Computational humanities, as described by Alex and Andy, would develop a scientific humanities. Greg's view is to crowdsource the humanities. I suspect they are going to be reconciled by the end of the week.
Crane talked about Edward Everett and his talk justifying European colonialism. Everett is what happens when you don't have humanistic criticism. Likewise, where were the humanists during WW I. Humanists and classicists where part of the problem teaching about the
He then asked us "What do you imagine for 2019 or 2114? What values do you bring? Are our institutionalized humanities advancing national or nationalist identity? Does your work advance an interest group? Do you have a state religion?"
Humanist education requires:
Then there are things to worry about like academic scholasticism where we are attached to our knowledge (and not interested in new ways of knowing) and where we are attached to virtuoso performance.
We need to model and explain the transactional costs of getting information. We need to drop the barriers to access and techniques like topic modeling need not outperform an expert to be useful.
We need to stop focussing on virtuoso performance that is inaccessible. He made an ethical argument about the professoriat. If we can't teach our research there is a problem. It is a scholastic fallacy that experts decide what is and is not important. We are all public servants and we need to maintain a public consensus about the value of our work. Our internal challenge is how to do more research, but there is dramatic change.
Don't expect really successful humanists to do really new work. If they are successful why would they change.
Think in terms of what matters and fight for that!
I asked about how his vision of Computational Humanities matches the other visions I've heard. He argued that he is presenting the ends for which other visions are means and tactics.
Bettina asked a good question about the incentives of money. The money doesn't want citizen science or teaching kids - they want applied research that has economic impact. Criticism doesn't pay.
Manfred Thaller was critical of Crane's vision. He feels that the humanities lack a vision of what they are doing. The millions of classics students don't really drive things.
I can't help worrying about the connection between developing a critical public and crowdsourcing the humanities. Criticism is seen now as political and not public work (at least in North America.)
Excursion and Dinner
We had a nice excursion to Trier and dinner in a winery. Loretta Auvil gave me all sorts of good ideas about text mining tools:
Chris Biemann also pointed me to the neat tools his team has developed at http://www.lt.informatik.tu-darmstadt.de/de/demos/
Thursday, July 24th
David Mimno: Transparency
David Mimno talked about the need for different types of transparency. It is important that tools show their workings in different ways that can help humanists. They can expose their objectives, their implementations, the variables and so on.
I asked about how we can show the iterative processes. He talked about how topic modelling is intensely iterative as you edit your vocabulary. He went further and said that for humanists the iterative trimming of words is interesting as a way of thinking about the text, even if the results are not spectacular.
He showed three really neat examples starting with a simple bit of code that fits a line to generated points. The line is the model.
Szymon Rusinkiewicz: Reassembling the Thera Wall Paintings
Rusinkiewicz talked about the neat Thera project where they have lots of fresco fragments that they want to document and reassemble. They used different scanners - a 3D laser range scanner. They also used a flatbed in an innovative way to get a 3D scan. They created software that automated a pipeline. He talked about a ribbon matching process for finding fits and he showed an interface for trying to match fragments.
Brian Joseph: The Web of Language
He started by pointing out that language(s) drive the web which is why linguistics is so important. What could the web mean from a language perspective:
He reminded us of a seminar paper by Christiane Fellbaum on "What can Computational Linguistics (not) do for the Humanities?"
I would add that the web can also be a place of play and that there are hybrid places (if place is the right metaphor) which are both web and real.
Christopher Brown: Center for Digital Research in the Humanities
Christopher gave us a survey of digital humanities at Ohio. They have discovery themes that include data analytics. There is a lot of government funding in big data analytics. There is a Advanced Computing Center for Arts and Design. They have a Global History of Health project. There is an emphasis on e-learning - they are creating open learning materials and MOO Cs?.
We had a joint session with the other seminar on paleography and computer vision. Some issues that came up from the Paleography group:
The Reflection group discussed the CS/DH relationship. What are the road blocks?
What might work?
Typology of projects:
The Mixed Bag group discussed:
I presented on behalf of the Impact group. I presented (among other things) an outline of a discussion of ethics and big data:
There seemed to be a lot of anxiety about the relationship between computer scientists and humanists. It is great to hear CS folk worrying about this. I think we (humanists) worried about it alone - now there is a dialogue.
Susan Schriebman: 1916 Project
Susan talked about various projects related to the upcoming centenary of the Easter Uprising of 1916 in Ireland.
Contested Spaces is part of a Mellon project trying to model how virtual worlds can be used for research. They therefore need to work in common with others in the consortium. They are trying to find light-weight tools for doing the work that others could use. They are modeling a block of Dublin where a battle took place so that questions can be asked about the event.
She sees herself as doing something like a digital edition where you have the model/edition and the evidence used to make it.
Another project is the Letters of 1916 project that is trying show what life was like in that year. Here they crowdsourced the project to get letters donated and encoded. Crowdsourcing changes how you do the project - among other things you build the tool at the beginning.
She ended with some examples of social network analysis and text mining.
Greg Crane and Manfred Thaller Discussion
It was clear at the end of Greg's talk that there was a disagreement between Greg and Manfred so a discussion was organized.
Greg started by making a few points. He commented on how they might agree violently. His one central idea is that we are looking a radical new and deeply traditional form of education that happens to be 200 years old. The German tradition is about instructor and student serving Wissenschaft not each other. The American idea is of class formation. He went further and argued that without classics the other humanities are exposed. He pointed out that if research is not integrated into pedagogy then people will ask about the need for research. He talked about how students in STEM are integrated into labs and get to do pseudoresearch. In the humanities they are told they have to wait until they have been stuffed with information.
This seems very different than the teaching critical thinking argument he presented last time. This argument is more about bringing undergrads into research.
Manfred agreed with the value of putting stuff up, but feels what Greg describes is a side-effect. It is a welcome side-effect. The humanities up to the 1960s the humanities provided an implicit vision that by looking at the past one could help the nation. The humanities studies very small amounts of information - now we can look at much larger data sets. Someone should look into how changes in scale change different fields. We may have a Kuhnian paradigm change.
He hopes that the humanities can establish that they are those that understand what happens in 1 million books. The humanities will have a better future if we can explain a mission/vision that goes beyond the incomprehensible. He believes that a part of that vision should be that the humanities teach people not just to answer certain questions, but to ask new ones.
Tony Davies argues in Humanism that "All humanisms, until now, have been imperial. They speak of the human in the accents and the interests of a class, a sex, a race, a genome. Their embrace suffocates those whom it does not ignore." I wonder if their arguments are really for the humanities or for the ethical pursuit of knowledge. Do we need to save the humanities or ditch it for something better.
David argued something similar for CS - should we be trying to save CS or ditch it for something else.
Friday, July 24th
Last night I read the position paper of one of the organizers, Christiane Fellbaum. In the paper she says:
[A] political scientist colleague believes that we can help him to better understand the concepts of “self-determination” and “nationhood” and their changes throughout time and space by means of sophisticated textual analysis. I have not (yet) managed to persuade him that such lofty goals are beyond our computational wizardry and that we are not yet prepared to make meaningful contributions to the deep understanding of these difficult notions, which political theorists have struggled with for a long time.
While she is right that we can't automatically generate solutions of concepts, I think this is the sort of ambitious challenge that we need to work towards and her questions. I also think the bar for interpreters is lower than for computer scientists. A tool for "better understanding" doesn't have to solve a problem or prove/disprove a hypothesis - it needs to suggest lines of inquiry. I would call this the hermeneutic challenge - how can computers help us better understand concepts in the process of thinking through. I don't expect computing to end discussion, I want it to assist discussion. My sense of the questions we need to answer (based on Fellbaum's) are:
I later talked with Cathleen Kantner who pointed me to a neat project she is part of at the University of Stuttgart on tracking complex concepts through newspapers in different languages. The site for the project is here (in German). There is a paper in English that explains it titled Towards a Tool for Interactive Concept Building for Large Scale Analysis in the Humanities (PDF).
|Page last modified on July 25, 2014, at 10:22 AM - Powered by PmWiki|