Main »

Digital Humanities Concepts 2015

These notes are about the conference at TU Darmstadt on Key ideas and concepts of Digital Humanities. They are being written live and therefore include all sorts of distortions, gaps and so on. Thanks to Wilhelm Ott for corrections on his talk.

The Twitter hashtag is #dhconcepts

Andrea Rapp: Welcome

Rapp and others from TU Darmstadt started the conference and talked about the Digital Humanities initiative and the Institute of Linguistics and Computational Humanities. TU Darmstadt has a Masters in Linguistic and Literary Computing and is just introducing a BA.

Michael Sperberg-McQueen: What does descriptive markup contribute to digital humanities?

Michael thinks that descriptive markup is an important (but not necessarily the most important) concepts of DH. Markup is a tool and tools are not value-neutral and it therefore requires critical examination. He believes that it provides a compelling account of the nature of text. His slides are here: http://blackmesatech.com/2015/10/KIaCiDH/#(2)

By descriptive markup he means a set of practices and complex of ideas that is broader than the concept of SGML/XML though they are very important to the ideas. He presented a number of arguments:

  • Documents have structure that is worth exposing to software - this is not new, every text we look at has internal structure. Surprising
  • There is no global vocabulary adequate to everything we wanted to do.
  • Documents can be made reusable by representing them in a application-independent and vendor-neutral form. We in DH want data longevity as software doesn't last very long while data does.
  • It makes more sense to encode structure and in an open way. That leads us to ontology as it becomes an issue of what things we consider structure.
  • Declarative semantics may allow us to reason about representations in ways that imperative semantics don't.

He then talked about validation and verification. He quoted Johanna Drucker to the effect that XML imposed a single hierarchical structure on texts. For Michael it is a half-truth that is only partly true. There are ways in XML to represent a directed graph. He went further to say that there is no structure to XML. This struck me as true, but not true enough. The way XML is implemented strongly encourages hierarchies. He admitted as much when he discussed how it can be difficult to use secondary hierarchies. He made an engineering claim for the value of XML and on this I think he is right.

He then talked about COCOA as an alternative, but neglected to mention some of the useful aspects of COCA.

It would be nice to have better data structures, to find better ways ot document tag sets, and better validation mechanisms. Can we make markup better in our real work?

His last question gets to an important issue. I would argue that XML has (had?) become a sort of interface for digital humanists that gave them control over the editing of the text. Now the demands of rich encoding make much XML too complex to read or use by humanists as interface. It is becoming a hidden data standard gestured to more than used. What difference does that make.

George Landow: It's all Google's fault!

Landow talked about how technological terms like "boot up" have come and gone out of "analogical utility." The technologies of the old generation don't make sense to the next.

The first law of media: "no free lunch". Every medium has costs and advantages. Writing supports asynchronous communication - you don't need to be in the same place at the same time. Everyone thinks they want immediacy, presence. Do we really want unmediated access? Think of all the things print text is good at.

He showed an animation by Paul Kahn of how the Memex might have worked. Landow pointed out that everything said about hypertext today was true of the microfilm ideas of Vannevar Bush. He then showed a screen of Intermedia - a great networked (LAN) hypertext tool had features still not found. He talked about Stretchtext which looked like a notebook that could expand and collapse.

The along came Google and now we are stuck with the web mediated through them. What made the world wide web the killer app? He talked about how the web is limited and how it is a fancy version of gopher. How can we get back some of the great ideas? How can some of the key ideas

The issue he is interested in is that we not only have to develop electronic media that enrich the reading experience, but that we have to show people how to use media. There is no unmediated reading.

Marco Passarotti: Great expectations seeding forests of trees. Some key ideas of Digital Humanities in Father Busa's own words

Passarotti is the Director of the Index Thomisticus. He was the pupil, friend, and collaborator of Father Busa. He is now taking the Index in the direction of tree banks.

Passarotti started by jokes about the T-shirts that have our (speaker's) names on the back. He then talked about Father Busa. Busa's first paper was from 1962 which is in the middle of the Italian economic miracle and the cold war. Busa talks about automation and the anxieties of the times. Busa talked about the dialogue between technologists and humanists. Busa used the metaphor of collecting flowers - the humanists pick the choicest ones, the automators mow the whole field. Passarotti continued to show the types of questions Busa was asking and showed how many of them can be done with computational linguistics methods. Busa concluded that the activities of production, trade and defense demand automation of information retrieval, something that calls for the expertise of the humanities.

He then jumped 30 years to a Busa paper that reflects back on linguistic informatics. Busa felt that nothing much had moved forward - it was all the production of concordances without doing the difficult work of computational linguistics. He felt we need ways to connect pronouns to nouns, to formalize the semantics of words, and so on. He felt that linguistics would explode into an information industry. He sounds like he was thinking about big data.

Passarotti then talked the documents one get access to. He is editing with Nyhan and Ciula an edited collection of Busa papers. Then there is the Busa Archive at the Universita Cattolica del Sacro Cuore in Milan. Passarotti showed some documents like an early letter talking about the project. They have a lot of materials from the Index like punch cards.

Lastly Passarotti talked about the projects of the Index Thomisticus now. In 2006 they started the Index Thomisticus treebank. This is this the largest Latin treebank available.

Manfred Thaller: Automation on Parnassus - Clio / κλειω and social history

Thaller started with a bit of history. He got interested in statistics in Vienna and wondered if one could apply stats to history. He learned to program and got roped into all sorts of cool projects dealing with everything from the 9th century to 1972. This led him to imagine a system that could support all sorts of things one wants to do to historical sources. He described an architecture which deals with different data types. He noted that a lot of what historians do is work with sources and they do this through text tools.

He then showed some neat image annotation tools.

He joked about how "workstations" were then as sexy as the "cloud" is today. His workstation ideas were based on the assumption that databases could be treated as source publications.

He is imagining an environment that lets all methods to be applied to historical sources. For that one needs a conceptual model translated into a high level technical model which is underpinned by low level software. Why not teach programming instead? The challenge of supporting all methods is that there is close connection between data forms and methods. One needs to change the source data form regularly as one wants to support new methods.

Sources may be commented, but never corrected!

Day 2

Susan Hockey: Perspectives on some key developments in text-based applications

Susan talked about her background and how in humanities computing we are doing what we do in the humanities - teaching critical thinking.

She talked about computing at the time which was mainframe based. A key problem then was how to represent character sets on the computer. There were all sorts of problems around how to represent sources.

She talked about how the engagement with computing forced humanists to think about their practices and formalize them.

She talked about her first work to output Arabic on a graph plotter (screens weren't a mandatory interface.) This forced her to think about typesetting and efficiency.

She talked about the web and the pros and cons of it. There was too much information and emphasis switched from analysis to presentation. She felt there was a lot of reinventing wheels.

She talked about the issue of courses - what do you teach? Encoding? Programming?

She talked about the challenges of growth. It used to be a small community that had to deal with all sorts of new issues when things grew. How to stay in touch? How to collaborate? How to work with people in other disciplines? How to deal with funders? Funding streams had to be justified and used credibly.

She summarized achievements including how to maximize investment in the data.

Higher Education in the UK mostly speaks to itself. There is a huge community out there that is very interested in all sorts of things. There are lots of volunteers in the cultural sector who are helping with all sorts digital projects.

She talks about how we gets a lot of credibility from having a reputation for thinking ahead. She invoked Busa to the effect that we should "imagine the future and aim for it"

Elizabeth Burr: Digital Humanities - The long way to teaching and learning a new

She talked about her background and how languages were research when she was a student. It tended to be studied in the abstract. She wanted to use real examples - building a corpus of newspaper language usage. She talked about finding the the ALLC/ACH community. Her first conference was in Seigen, as was mine.

In 1998 she used TACT-web to put the corpus online for teaching which led her to talking about teaching.

She talked about issues of women and gender. She mentioned an group called CHiME. There are all sorts of stereotypes about gender and the humanities. CHiME's objectives was to overcome the stereotypes and to address women's needs and goals.

In 2007 she started a Digital Humanities project set up a European summer school. The Manifesto of the Neogrammarians divided linguists so the manifesto would need to be digitized and documented. In 2009 she started a summer school. She struggles to get funding every year. This summer school builds community and addresses the gender gap. She has designed a programme that is convivial. It is programme that is also about more than just learning tools.

Wilhelm Ott: Designing humanities computing tools: insights from a 49-years trip from assembler programming to an XML-based toolbox

Wilhelm Ott started by talking about how his first course on computing was in Darmstadt in 1966 working on the IBM 7090 at the Deutsches Rechenzentrum which was the first one in Germany of research. The FORTRAN programming language wasn't very good at characters so the Darmstadt team had developed a set of subroutines for handling texts. The exercises were not that interesting so started, during the course, to develop a program for Metrical Analysis of Latin Hexameter. The university of Tübingen where at this time he was a student of classics also created one of the first humanities computing positions which he took in October 1966.

In Tubingen, where he ended up, there was a Control Data 3200 computer. He had to recreate the basic subroutines. Then he had to help with a lemmatized concordance which had to deal with different character sets and linguistic features. These projects were supported by special FORTRAN tools. It became impossible to support all the projects without teaching programming. All this led to a general tool, TUSTEP (TXSTEP) one of the great text analysis and typesetting tools.

From 1973, he organized the well known series of Colloquia on data processing in the humanities that not only helped computing humanists to exchange ideas but also gave feedback for the developers of TUSTEP.

Ott had fabulous slides showing the concrete steps and printouts of concording with mainframes. One gets a sense of how TUSTEP grew out of a wealth of real projects. There were concording programs out there like COCOA, 1965, but these didn't do lemmatized concordances. They developed a number of modules that made for a system that is not a black box, but a scripted environment. That about 1600 reference works have been developed using TUSTEP is a testament to its usefulness. In particular it has modules for going all the way to publication, something few others can do.

The new TXSTEP, the XML version of TUSTEP, is an XML scripting language making it more accessible to an international audience.

Fotis Jannidis: Using large text collections for text analysis

Jannidis read a paper about using large collections in literary studies. He talked about what we hope to do. The story of fantasies is also about fantasies, what would it be like?

He talked about a history of large text collections in text analysis. He specifically focused on a trajectory of method. Crane in 2006 asked "What to do with a million books?" What he described then seemed visionary, but today seems weaker. Crane discussed information extraction.

More recently the discussion around distant reading. One of the methods that changed our fantasies is LDA or Topic Modeling. He talked about the explosion of articles about and using topic modelling.

He talked about the real or imaginary fields of application of the method (topic modeling.) Jannidis commented on the rhetorical move in the Blei article about topic modeling. The article just shows the topics and they talk for themselves. The problem is that it is hard to evaluate. Everytime you run LDA you get different results. There are also assumptions in LDA like the bag of words assumption. The assumption is that the order of words or the order of documents don't matter. It also assumes the number of topics (chosen as a parameter) is the right number, whatever it is.

Newman talks about the labeling of the topics is a subjective task. Topic modelling doesn't discover real topics, but the illusion of topics. The method hits the text and it sparkles.

Hall, Jurafsky, and Manners (2008) use LDA to look at the history of ideas. They reran LDA over an over taking the topics they liked from each run.

For Jannidis LDA spawned lots of experiments and lots of introductions like Ted Underwood's. The success comes from asking people to just look at the topics.

Underwood argues for LDA at scale. Jannidis points out that it is not surprising that in large collections there is a "topic" about just about anything. Underwood also makes a move about LDA for literary study that it is not topics, but "a discourse or a kind of poetic rhetoric." The ambiguity of a topics is an advantage for a literary scholar as we can always make something of it.

Then Jannidis talks about Jockers 2013. For Jockers clusters and clouds are "self-evidently thematic" - they are themes.

Jannidis concluded by talking about how the method LDA was adopted and shaped by humanists. Leonhardt 2014 "Mining large datasets for the humanities" suggests two tasks:

  • Looking for what you think is there - Google ngram viewer
  • Exploratory analysis where you let the data organize itself - topic modeling

Geoffrey Rockwell: Thinking Through Things Like Analytical Tools

I gave a talk tackling the question I believe we have to ask after Snowden: How can we ethically think through analytics?

Hans Walter Gabler: Digital challenges to scholarly editing

Gabler talked about the opening up of the editing environment, specifically to the public so that they can contribute. The constant dialogic and dynamic of the digital is what makes research online editorial environments. The research is about the ongoing editing and negotiation.

A scholarly edition is common ground for other research. Scholarly editing is a mode of research - a distinct way of doing research that isn't based on a theme. It incorporates its own history.

He talked about the essential openness of the work of art. He talked about the work's potential to mean and how the editorial process tries to maintain that.

It is the power the language through which text triggers dialogic and hermeneutic processes.

The response to the text responds to the processuality of the text.

Kurt Gärtner: Editions printed and/or digital: Toward an open critical edition

Gärtner started by talking about how we are haunted by the death of the book and the end of copyright. Gärtner feels that there are a lot of anxieties around issues of copyright. The university of Darmstat was a recent defendant in a case for digitizing a book against the wishes of a publisher.

The founding fathers of German studies are the brothers Grimm and Karl Lachmann. Gartner discussed the early history of scholarly editing. He talked about an important story, Poor Heinrich, that the brothers Grimm edited which is the beginnings of critical editing of medieval German Texts.

He ended by thanking us for listening to philological krimskrams (knick-knacks) rather than big ideas, but was charming in his path through ideas.

Joachim Veit: Outside - inside: Two aspects of the digital turn in musical editing

Veit nicely took us away from text and towards music. Veit told a story of how he had a dream of researching a clarinet performance that he wanted to deliver to us. This led to talking about the online edition.

He talked about the principles of convenience and transparency behind digital muiscal editions. The convenience is that one an compare all sorts of things on the screen. He showed how his digital editions evolved over time.

Then he talked about neat stuff they can do with musical annotations from rending to analytics. With the online version people can save their annotations.

Veit is behind MEI (Musical Encoding Initiative). One of the promises is that with a standardized encoding model they can do analytics or pattern searching. Unlike text, without standardized markup they can't do any searching and pattern matching.

A difference between the music information retrieval world and text world is that there isn't really any big data yet in music.

Of course, the whole tour was faked as he is not a clarinet player, but a bassoon player.

Peter Robinson: Changing the world, one anglebracket and one license at a time

Robinson started by quoting Alan Liu to the effect that thinking critically about metadata should be able to scale to thinking critically about power, finance and governance.

Robinson started at Oxford and fell among folk like Susan Hockey, Nancy Ide, Lou Burnard and Michael Sperberg-McQueen. Text encoding and the TEI allowed him to develop the scholarly edition of Dante's Commedia edited by Shaw. He has build philogenetic software software to show the relationships between witnesses. The use of methods from evolutionary biology, the use of the entire corpus is what makes this edition special.

Some scholars argue that we aren't doing anything new, just travelling faster. Robinson suggests that this is not true, faster gives freedom and digital methods show new things.

Then he gave a quick history of editing. The Alexandrine Consensus was the old way of doing things that had inscribed academic authority. The internet threatens the consensus. If everything is online (rather than locked in a vault) everyone can see it rather than the lucky few. Also, the public can write their own.

Robinson then complained by how libraries are constraining things and then digital humanities centers acting as publishers. He gave as an example the Jane Austen's Fiction Manuscripts. The need for funding constrains digital editions in a world where most information is more fluid and open.

It need not be that way. What has to happen?

  1. Fundamental materials (images, witness descriptions and metadata) should be made available freely on the web under a Creative Commons Free cultural Work licenses. We need to speak to the importance of cultural licenses.
  2. This means you (us). We need to give up on copyright. We tend to be happy when others give up copyright, but are unwilling to do it ourselves.
  3. We don't just say we will give stuff away, but we really do it. We have to make sure stuff is not locked up behind an interface. Only when we really give it up will we be able to merge it into new aggregations.
  4. We need to agree how we name documents and the texts they contain so data can be merged.

He then talked about textual communities, a tool that makes it easy to give away your work.

He finished by returning to Liu and arguing that we need material to think critically about and we have that in the issue of shared metadata.

There was an interesting question about how people could misuse our materials. Peter answered by pointing out how we can use moral rights to handle this.

Podium Discussion

They had a really neat closing session to the day when they invited a number of members of the audience to respond as a panel to the talks. Some ideas from the podium:

  • The hack is the yack - the conference was about concepts, but there is a lot of theory in tools
  • The MA here (Darmstat) nicely merges a number of different threads from the math, to the history, to language teaching. It sounds like Darmstat has created a great intellectual culture with weekly talks.
  • The multidisciplinarity of the conference stands out
  • How old the discipline the field is - how many projects before it was institutionalized
  • We need to be open to yet more disciplines
  • We need crazy ideas and the library could be a place for them - the library could connect all books with another
  • One of the things that is striking is the way it historizes the field - partly because so many of us have been around
  • Is the field repeating itself? Historicization should help us evolve rather than repeat
  • Is there a feeling among people new to the field that the history matters? Or is it just about learning the new stuff
  • There was a question about the definition of the digital humanities
  • Does each generation reshape the field? Would we be interested in the wider question of how digitality has been deployed in the humanities? How would a survey work to answer that question?
  • Some forms of DH that used to be important are missing here - the digital arts and language learning. Likewise
  • What can we learn from computational linguistics that isolated itself as a way of building the discipline?
  • Has there been a phase digital science comparable to digital humanities? Bioinformatics and computational science?
  • In institutional disciplines the questions are narrow. In DH one can play with others and ask bigger questions.

Day 3

Julia Flanders: Looking for Gender in the History of Digital Humanities

Julia Flanders started by apologizing that she is working on this subject and therefore will weave a lot of ideas together.

How do you look for gender in the digital humanities? She started by looking at statistics. ACH has had a female presidents the majority of the years and (I think) a majority of female officers.

What she couldn't tell is whether the gender of the officers made any difference. She went on to position herself and talk about how her background gave her all sorts of advantages.

She talked about working for the Womens Writing Project. She started at the WWP at the time when access (through the web) was becoming important. The circumstances activated a politics of inclusion animated by ideas of rectifying the canon. Many early projects started this way. In all of these womens writing efforts gender is visible as a cultural category and they intensify the category. They combat misconceptions about what is there - they show what people didn't think was there. They establish marked spaces, but leave in gender unmarked in general collections. Many of the major resources like Google books, EEBO don't have gender information.

What is striking is that the only gender one can study is the female. Attention to gender is attention to women. Women are now visible, but gender is hard to study. The projects create distinctive research spaces that are valuable, but are limited. This challenges ideas that if you add women back into culture and stir you solve everything, ignoring other categories.

She then started talking about identity. Does this require that we are epistemologically committed to physical difference? Do we need ideas about performing female? What do we do with texts where the author many not have been female, but presented as such? Is the WWP collecting women writers or writers that position themselves as women?

Gender as identity - should we pay attention to what author's think is their identity? Is gender a self evident category? Should gender be treated as a category of identity when for many authors it is not. What is important can be how gender is access to power.

Hypertext theory argued that hypertext had a structure and a politics - that hypertext re-architected the relationships between authors and readers. A gender politics could be embedded in our ideas about textuality. Authority traditions in scholarly editing can construe the world hierarchically in terms of masculine universals and feminine specificities. Gender becomes a way of explaining or illustrating other structures that don't really have gender. Given the centrality of editorial theory to the digital humanities, the work on the gendering of such theory has implications to the digital humanities.

She wants to pay attention to the ways that gender politics influence the architecture of thought. (Check out "The Power to Name.") Information systems based on difference might carry gendered ideas because gender is seen as primary difference. Gender becomes a way of explaining other things, but this is based on simplistic readings of gender in order to ground explanations. This has the effect of making attention to gender difficult.

An attention to gender is assumed to be a feminist agenda - but why is that? Why shouldn't everyone be interested in the gendering?

Cataloging is a work process that takes place at scale. Could that create stresses on the categories and encourage simple categories.

The inventory of female and male bodies belonging to those given power has gotten a lot of attention. (I think she was referring to Deb Verhoeven's challenge at DH 2015.) Shouldn't the content of the minds be also important.

She personally feels her cultural determinism far more strongly than gender determinism. She like many of her colleagues is working hard, belatedly, to overcome the limitations of her privilege and ubbringing. She ended with the saying:

"Try again, fail again, fail better"

Claus Huitfeld: Philology and text technology

He started by talking about philology. In a narrow sense philology is the careful studies of words. In a larger sense it includes all sorts of studies of human traces or acts, past or present. To a large extent we are talking about the study of documents.

The digital has influenced things so much that he wondered if we need to talk about "digital" humanities at all or just the humanities.

Huitfeld talked about joining the Wittgenstein project and the TEI. They worked for years and were able to publish an edition on CD-ROM. Once the project was finished he left. What did he learn from that project?

  • When running a big project, plan well ahead because once you have done a lot of work done it is hard to revisit early decisions
  • Be prepared that 80% of your time will be spent negotiating copyright and getting funding
  • The Wittgenstein project is still going - not because of manuscripts, but because of the community
  • They are re-encoding
  • The facsimiles are being redone - this is not a problem, it means that people are using the materials and find they need better facsimilies, which is good

Then he talked about formal methods in the humanities and sciences. He talked about how there was so much energy in the debate around AI by folk that don't work with computing in the humanities. AI has dissipated in philosophy.

Huitfeld mentioned "A competitive framework for studying the histories of the humanities and sciences." It discusses how disciplines could be studied in terms of formal methods. The author of this talks methods like stematics, formal grammars, source criticism, philological methods. Huitfeld made sense of this claim as I would have dismissed it as having little to do with the realities of disciplines. It is interesting to see how methods move across disciplines.

He then talked about whether NLP could help with automated markup. He asked what we want to do? What are we interested in? If we start from what we want to do and try to formalize the methods then we can

We have gone from printed editions of written source texts to digital editions of printed source texts. What about when we want to create digital editions of digital source texts. John Lavagnino has asked this question. Do we have any answers?

Julian Flanders asked about the digital critical edition of the digital critical edition.

Nancy Ide: The TEI Legacy: Where we have gone from there

Ide started by talking about the TEI. It started in 1987 with funding from the NEH. She and Michael SM organized a Vassar Workshop with about 35 people.

The DH and TEI community were interested in reusability long before otehrs. It wasn't until the 1990s that the computational linguistics community got interested.

She talked about the spit between digital humanities and comp linguistics. The two split in the late 1960s as the CL folk get into logic-based processes and toy applications. In the late 1980s CL gets into statistical analyses.

Humanists are focused on details and throroughness, perfection, skepticism, circumspective. CL on the other hand want to get the job done, generalities, doesn't have to be perfect, OK to believe things, not much circumspection. Very different field or way of thinking.

Why didn't CL adopt the TEI. At the time they didn't see it as doing something that serves their purpose. They saw their purpose as adding linguistic info to texts to develop models and learn things about language. The CL community wanted only one way to do things. A single way of encoding makes software design easier. So the CL community went off and developed their own standard. Some even thought they didn't need standards as they could write scripts.

EU MULTEXT was a TEI application in 1994 - a streamlined core set with lots of linguistic phenomena not in the TEI. (X)CES introduced standoff markup - called remote markup. The idea started with a TEI group that was abandoned. The idea in CL was that the text was inviolable. It allows all sorts of different layers to be added to a text. With layers and alternative annotation layers for the same thing the user can pick and choose layers.

Two influential efforts for encoding linguistcally annotated data:

  • Annotation graphs (Bird and Liberman) - time stamped for speech
  • ISO LInguistic Annotation Framework (LAF)

Both were models not a specification for labels as it is hard to specify labels. The graph model was very helpful. The model was separated from the serialization. This graph model has served for linguistic annotation since. Thus people can use other people's software.

The new mantra was interoperability. The goal was something that works. The graph model defines a syntax.Semantics - annotation labels - are not specified as it is difficult to define in linguistics.

Descriptive markup is definitions of specific terms - a dictionary of categories.

What is happening in 2015? The big trend is linguistic linked open data. Semantic web technologies have matured to the point where we can begin to use them to represent annotated language data and relations. There is a linguistic linked data open cloud. The efforts now are on interoperability.

People are also trying to work towards a global network of tools and data. There are open language grids so that tools and datasets can all be used together.

She then talked about how CL and DH can collaborate. CL frameworks of tools could be useful to humanists. DH has some great datasets.

Where does the TEI fit into this. The TEI took on both syntactic and semantic aspects. The CL world has disentangled the various aspects and treated them separately. Now that the TEI is introducing stand-off then we should be able to join DH and CL layers.

She showed a neat visual programming environment. Tools rarely interoperate and data formats are not compatible - these big grid projects can help.

And at this point I had to leave.

Navigate

PmWiki

edit SideBar

Page last modified on January 07, 2016, at 08:24 AM - Powered by PmWiki

^