These are notes the Conference on Data in Discourse Analysis held at the Technische Universität Darmstadt (TU Darmstadt).

This conference is, in some ways, the celebration of a new MA in Data and Discourse Studies. This MA looks like a really interesting programme that brings humanities and social sciences together.

As is always the case, these notes are being written during and after and even before the talks so they will be full of misunderstandings. Please contact me with corrections.

Tuesday, February 18th, 2020

Lou Burnard: Language as Data in the Humanities

Lou gave a great talk surveying how language is represented as data. His slides are here. He began the talk with Blake's image of Newton. By adding some metadata he changed our view of the image.

He then talked about how data is omnipresent but not omniscient. I think this is one of the gifts of the humanities, that we bring a skepticism to the evidence as given - the data handed over.

He talked about how much there is that we might want to markup using XML. He challenged the idea that there could be a "pure" transcription. He argued that all transcription and markup is interpretation. In this view, that I agree with, markup is scholarship.

He argued about what a text is and how documents may be only bearers of the text which is an abstract notion in our head. This sounds like a Platonic theory of forms, and I'm not sure it solves the problem he wants it to.

He ended by arguing along with Eco that all interpretation, but some are more generative and produce more new interpretations than others.

He concluded with "Text is not a special kind of data: data is a special kind of text". This seems a matter of semantics.

Geoffrey Rockwell: Ethics and Datafication

I then talked about the ethics of the Google Duplex demonstration Pichai delivered in May 2018. I wanted to unpack what was unethical or not in the Google demonstration of a bot that could call and, without identifying as a bot, book an appointment. The outline of my paper was to:

• Convince that the debates erupting about the ethics of different AI systems are relevant to the work we do with data in the social sciences and humanities. • Second, look at some of the ethical problems raised. • Third, ask what AI principles might apply to this case, • And end by discussing how an ethics of care might apply.

Obviously I couldn't take notes on my own paper.

Thomas Stacker: The Digital Edition as a Data Resource

Stacker started by asking what a digital edition might be. He talked about editing is linked to a critical process that involves a transformation. In his paper he focused on what it means to transform an existing text into a digital medium. Making a digital edition is a process of datafication.

Stacker made the interesting point that an edition is neither the XML code or the interface, but a conjunction functions over the code and output. (I probably didn't get this right.)

In the print era we know what an edition was. The components were clear, but in the digital it isn't clear what the components are. In addition you have phenomena like transclusion. In the digital edition there are all sorts of views. There are functions like interfaces.

Stacker talked about the problems with digital editions. They are hard to quote as they change. There can be difficulties with authenticity. We don't have processes for long term archiving.

He then talked about how a digital edition can be judged. Digital editions cannot be judged by its interface. It is assembled from other sources. A digital edition is data, but not a digitized version of another edition.

We can form some hypotheses:

A DE is different from a printed one.
A DE can be used and quoted by an interface.

He asked an interesting question, Is a digital edition it really an edition?

He talked about text being a sequence of codepoints and fonts/glyphs. Then we have the structural model built on top of that. Structural analysis shows us the entities and so on that get added to the plain text. Then he talked about interface and how with digital editions the interface is how they communicate with users.

He asked about transferability of editions. How do we ensure the "sameness" over time and systems? This is especially true if the interface is flexible. It is also a matter of the sustainability of the edition.

Tony McEnery: Data and Discourse - Three Observations, Three Case Studies

Tony McEnery asked about investigation of the past through corpora. He likes to use corpora to go into the deep past. The resources are amazing now. The scale coming available is changing dramatically. What of the advantages of getting all this data? One advantage is avoiding artificial surprise. Studying discourse about muslims after 9/11 they found that people reached back into the past to recover ways of talking about muslims.

Some of the key elements are:

Pattern discovery
Resource interpretation
Discourse analysis
Triangulation
Close reading is a key

He talked about how it is still important

Language is not static. Words are not epistemologically stable. This means measuring over time can be misleading because the meaning of a word changes so it isn't really the same word. Too much research assumes language is stable. He covered a number of problems.

He gave an example of the Mahometan Berry panic. All sorts of people talked about how there was a panic around the Mahometan Berry (coffee) corrupting the English, but there was actually not really an issue. When you have a billion words you can test a theory about panic and it turns out to be false. He talked about building theories up from the ground.

Nancy Ide: Data and Annotation: The Impact of Big Data on Discourse Annotation

Ide started by pointing out how big data has changed computational linguistics. Now all they do is machine learning. What they need is annotated data for training. Most of what has been annotated is structure. She showed a great timeline of what was happening. Before 1990 it is mostly theoretical. After 1990 there are some larger amounts of data. After 2000 we get large annotated databanks.

After 2000 is the golden age of discourse annotation. People start by trying to adapt existing theories. Later people needed theory neutral annotation.

There are now lots of different theories. Most define hierarchical structures by constructing complex discourse units from elementary ones.

There is a lot of disagreement about what the relations are. Some argue for intention and some for information. She gave us a tour through theories and models.

The idea now is to try to get a data-driven, emergent theory of discourse that is free of subjectivity.

She concluded by saying that it is all fairly subjective. Machine learning trained with these annotated corpora will betray the theories coming in to annotation. That said some concrete applications that use treebanks still work fairly well.

Evelyn Gius: Data and Annotation in Literary Studies: Some Methodological Remarks

Evelyn Gius wants to think about what it means for literary studies when we think about thinking about data. She had a neat image of the Christmas tree and the tree rendered as different substances. She then quoted the passage in Swift's Gulliver's Travels about the literary calculating machine of Lagado. Is that what

Evelyn theorized what it is that scholars do in literary studies. She broke it into input, analysis and output. She made the point that scholars think about more than texts. It isn't just analyzing text. They might be looking at context or the author or the reception. The question is how to bring things like context in? Does one add metadata to the text? It isn't as simple as ingesting texts and then outputting interpretations.

Analysis can focus on different aspects or features. It can include intra or extra textual aspects. It can bring together a given corpus with annotations for further analysis. The issue with annotations is getting intertagger agreement. To do that one needs guidelines and training and testing. Then one needs to see if you get any useful analysis using annotations. Sometimes you don't get what you wanted. For this reason there is an iterative annotation process.

Evelyn then discussed the distinction between explanation and understanding. She talked about how literary studies people are starting to critique algorithms. She then made a very interesting move when she argued that the ethical principles from the IEEE like transparency and accountability could serve as guides for literary interpretation.

She concluded with the minimal theory of computational literary studies:

Text concept
Theory of interpretation
Method of interpretation

Panel on Data, Discourse, Fieldwork

We ended the day talking about the panel:

Jens Steffek is the chair
Audrey Alejandro (Political Science, London, UK)
Nele Kortendiek (Political Science, Friedrichshafen, Germany)
Claudia Mitchell (Educational Research, Montreal, Canada)

Claudia started by talking about working in participatory visual culture. She works with young people in different countries whose voices are not making policy. The youth make images to talk about issues like gender-based violence that can communicate to others. Claudia is interested in the site of production and audiencing. She talked about what it would look like to have 18 year olds develop metadata for what matters to them.

Nele talked about a recent project she worked on where she went into the field to Kios to do interviews with people involved with search and rescue of migrants. She was also a volunteer. There wasn't much texts and what is available was stylized reports. Going into the field you discover the reality has little to do with the public documents. She wanted to see how different actors cope with what is happening.

Audrey works on naturally occurring texts, but also on elicited texts. She thinks about the conditions of emergence of texts. She is also interested in developing reflexive methods so that we can learn the sites of production. She was very interesting on field work. She talked about how interviewing someone and taking notes can be violence or it could feel spiritual.

I was intrigued by the discussion of reflexivity and how one needs to be honest about the perspective one brings to an issue.

I asked about silences and deletions which elicited fascinating answers describing different reasons or situations where the panelists have not reported something. Claudia admitted that the ethics are changing, especially around images, and there are things she now regrets publishing.

I'm reminded by the field of "ignorance studies" that asks about what is not known or deliberately left unknown. Don't all fields choose not to deal with things

Wednesday, Feb. 19, 2020

Moacir P. de Sá Pereira: Data Visualisation in Literature Studies

Moacir started with the lovely visualization in Tristram Shandy where the narrator, Shandy, shows how his story digresses. He pointed out that in two editions of Shklowsky's book the visualization was printed differently. The first time it was reproduced upside down. Does it matter? Moacir thinks not.

He then asks seriously what the visualization means. He had hoped that he could reproduce these from the text, but the text is complex. He showed a visualization that draws the first four volumes through a random walk. It has a control that can change the randomness. This random walk doesn't reproduce Shandy. Shandy's lines are expressive. What does it mean for Shklowsky to say the lines are more or less accurate.

He then showed a later edition of Tristram Shandy which has a different visualization. The lines now are not continuous.

Moacir showed a neat space-time cube from Kraak's The Space-Time Cube Revisited. He then talked about Matt Jocker's plot trajectory graphs that show up in The Bestseller Code and moved to Johanna Drucker's Graphesis. Moacir was critical of Jocker's graphs to the effect that they don't generate any new knowledge. They simply confirm what we know.

He then asked, How do we do data in literature? The mere capture of data is problem. He showed Voyant's Dreamscape and how the named entity recognition finds all sorts of locations in Shakespeare that are in North America. I'm not sure he read the documentation where we talk about how the data shouldn't be trusted and how one can disambiguate the locations.

He then showed his own geospatial representations of a novel. He had an interesting one that showed the "mean center" at any one point in the text. By mean center I think he meant a center of attention calculated by taking the named locations (and their location in the text), the location of attention in reading, and working out a geographic mean between the locations. This showed the attention of the book moving between Ireland and North America. He had two animations, one in which it is the plot that moves and in one the text moves. He is working with the difference between plot and fabula. If I understand the distinction, there is a visualization line for the order of what happens in the text and another one for what happens in the time of the novel.

He ran out of time, but ended with some intriguing "mushroom at the end of the talk":

Overwhelmed by the single aesthetic encounter
Making do with non-hierarchical scale
Affecting and affected by the object, situated
Noticing the worlds-in-the-making, grasping at moments

I must find out what this list of suggestions means.

His talk is here.

Noah Bubenhofer: Visual Linguistics: Fundamentals of Visualizing Language Data

Noah started by joking about the difference between an armchair linguist and a corpus linguist. A data-driven corpus linguist has lots of data and calculates patterns rather than collecting pre-determined features. He showed Ramelli's reading machine and then images of cards and card indexes - technologies used in lexicography. He talked about Busa and the Index Thomisticus. He jumped to the Mother of All Demos by Douglas Engelbart and showed some of it where Engelbart manipulates lists and then maps them. Noah summarized what we do as working on statements and doing operations on them and then constructing different views. Computers allow us to split, sort, decontextualizing, recontextualizing, linking data.

Corpus linguistics uses diagrams (Pierce) to get new knowledge out of data. He started with Key Word in Contexts (KWICs). The idea is to destroy the entity of the text and partition it into *loci* (matches) that can then be manipulated. I'm reminded of Jack Goody's "What's in a list?" and how he sees the list as the first form of visualization.

Noah then talked about Plato's Meno and the way Socrates gets Meno through a diagram to recapitulate a mathematical proof. A cool example of imagined visualization being used to understand something.

Diagrams can do different things:

Present results
Explore data
Show theoretical concepts

He talked about the different genres of visualizations in linguistics and showed examples.

Lists
Maps
Graphs
Partitura (comes from music and is used in linguistics to see the timing of utterances and gestures)

He then showed some of his projects starting with a corpus of thousands of birth reports. He showed various insights into this data. He showed a very interesting visualization of informal dialogue inspired by tree rings. You see who speaks the most.

He ended on visualization and thought styles. He had a neat collage of visualizations from a journal - mostly a lot of trees. There are traditions in the discipline of how things gets visualized. What effect do traditions of visualization have on the interpretation of

He talked about Ludwik Fleck who talked about different instruments in thought styles. The diagram is a form of instrument for linguistics. If these diagrams are instruments then the code/tools are themselves bearing knowledge. We need to look at the coding cultures. I'm reminded of Baird's Thing Knowledge.

He closed with a playful poem generator called Geottherina. It is built on a large corpus and had a face-like built interface.

He made a point that I agree with, but which has to be problematized and that was that we should know a bit about coding to understand how we get the visualizations. The general idea is that you should know how the sausage is made if you are going to eat it. The problem is that there is no limit to what you should know. You probably should know how the operating system works, and how the computer works, and the chip was designed and the physics of silicon.

His talk is here.

Marina Bondi: Linguistic Data and Domain Specific Language - Or: How specific is specific

Marina started by talking about the distinction between domain-specific and general-purpose language studies. There is no clearcut distinction, but in certain communicative situations people use general language and specific language.

She then talked about what "domain" means. It can refer to an area of knowledge or an area of life. An area is often talked about spatially. She talked about cartography and then "textography" where you take a space and study the language use in there. Register is the specific lexical and grammatical choices as made by the speakers in a specify situation (paraphrase Halliday.)

She talked about how specialized corpora have driven the field for decades. Specialized corpora link the corpus and contexts in which the text was produced.

She talked about the centrality of variation. Variation defines domain specific work, but variation is also one of the interests in domain-specific studies. People study a domain to talk about the variation of the domain which raises the question of how much variation is within the domain. This led to comparability and how we are always comparing apples and pears. She talked about comparing discourses on corporate social responsibility across English and Italian.

She concluded by reflecting on the research process. Research is an interaction between the analyst and different types of data and methodological tools. She goes back and forth between hypotheses and data.

Paul Baker: Linguistic Data and Social Groups

 Marek started by asking what's a social group? One can use self-identification, or shared identity characteristics or inter-dependence. He talked about how sociolinguistics (that deals with social groups) and corpus linguistics don't much have to do with each other. Sociolinguists are much more interested in non-standard varieties of language while corpus linguists are trying to get "conventional" language.

Why analyze the language of social groups? Researchers will justify social group research as a way of benefiting a group that might be stigmatized. He then talked about some examples. If we get well designed "small" social corpora we should be able to do comparative studies.

He talked about a "cohort effect" where there are language effects related to a generation and the circumstances of their growing up. With corpora from different years one can look at cohorts and look at how things change over time. He showed how different generations use the word "may." He talked about a corpus of women's adverts and how you can compare women seeking men vs women seeking women. When you do these studies you often find obvious things, but also some non-obvious things. For example he looked at "me" and "we". Women seeking men often used "me" and positioned themselves as the object - "Looking for man who will make me feel like a woman ..." The women seeking women are more likely to use "we" and talk about a collaboration - "we can motivate each other..."

He then talked about how age is one of the forms of difference most often exploited in talk. He led a study that grabbed lots of health communication and searched for age patterns. Why do people mention their age or gender?

He ended with a caution about the tendency to look for difference and to overlook similarity. He showed a venn diagram with all the words that men and women share. This shows how similar they are. You can also have one or two people in a corpus that skew results for a group. For that reason he always uses the concordance to check.

He concluded with some recommendations along the lines of thinking intersectionally. To just compare simple categories (men and women) misses the complexity of the social grouping.

Felicitas Macgilchrist: Data and Education

Felicitas started with the observation that we live in an age of datafication. She gave us some examples of the gathering about educational data like that of a company Learning Economy that consolidates information about students using block-chain.

She is interested in discourse about education, discourse in education, and then look at discourse encoded into education.

When datafication of education is talked about it is about how there are large-scale assessments like PISA or about the use of data analytics in teaching tools. There are new forms of marketization and psychologization of education with new actors like IBM. There is a lot of talk about the promise of data analytics. These discourses are often connected to the language of efficiency. There are these networks of "policy experts" being brought into education.

Discourse around data in education is changing as there are more and more targets and teachers are talking about what the data (that they get evaluated on) means. Teachers focus on the students close to making the bar that has been set (the D+ students as opposed to the A students.) School practices are changed by data targets and software (dashboard). The borders of the school become more porous.

Felicitas is looking at predictive analytics in a project called Datafied. Usually attendance, behaviour, course performance data is used. Increasingly learning systems data is woven in.

Entextualization is a process that makes a bit of text extractable so it can be used as a unit. She told a story that tied all sorts of things together starting with a PR text about Sudha Ram at the U of Arizona using big data and student data to help identify at risk students. There is a logic of risk assessment - there isn't anything about success. There are ethics issues around the amount of data gathered on the students.

Then she talked about the D2L Brightspace dashboard. It is supposed to be the future of education. Again there is risk assessment. There is anonymization so views can be shown. There is a social network graph that identifies the social aspects of a student. Social learning becomes part of the predictive model.

She talked about how messy the data is and yet the dashboards in the schools acquire an authoritative status and the messiness is forgotten.

She then talked ethical issues. "If we know a student is going to fail, is it ethical that we accept him in the first place." This is from an interview. This kind of statement reinforces.

The president of St. Mary University wanted to use predictive analytics to "weed out students unlikely to be retained." Predictions are embedded into rankings and economic logic. She compared this to a best practice where students get discussed and helped. She talked about what happens when a specialist is in tension with the prediction from the system. Will people back off and trust the system by default.

She closed on how code should be studied quoting Marino on code studies. I'm not sure the code will generate what she wants, but we certainly need people who can reverse engineer and study software. That should not prevent us from doing what she did with manuals, documentation, interviews, and interfaces is more useful.

Jens Zinn: Contested Knowledge -Contested Meaning: The Mutual Constitution of Data and Risk

Jens started by talking about how data about language connect to other aspects/data of the world? What forces change language?

He talked about the defining of risk. Various sociologists and philosophers of science like Ian Hacking have developed thinking about risk and studied the history of how risk is dealt with. The idea was that with rationalization we began manage issues and risk calculation was part of that.

In the 1970s risk became pervasive in the UK and the global north. Why is that? One idea is that there are now catastrophic risks like nuclear war or the environment.

Jens uses newspapers to study the discourse. All the articles in the London Times. He looks at words that collocates. Phrases like "at risk" are interesting. It takes off in the 1960s. He talked about the usefulness of using one work "risk" rather than a network of words. Threat calls up different content.

He showed how the language of what was at risk changed over time. Battering was one word and he showed how concern about the battering of children became an issue. Flooding another.

Jens talked about the way not-for-profits that focus on certain issues start campaigning to promote different risks that need to be addressed. "At risk" conveys the urgency of the issue.

He then showed a table of the words of what is vulnerable (at risk) like children and jobs. Women and children are at risk, but men are not.

Day 3: Thursday, Feb. 20, 2020

John Flowerdew: The Role of Data in the Analysis of Academic Discourse

He started by talking about research on academic discourse starting with Bourdieu's Academic Discourse. I wonder how this was tied to Homo Academicus. Bourdieu would test ethnographic materials statistically and that would feed back into ethnography.

Then John talked about a book he edited on Academic Discourse. There is an applied goal in the study of academic English in that this has to be taught to people who want to be academics in fields where English is the lingua franca. Finally he talked about Hyland.

John then argued that studies of academic discourse can focus more on the text or more on the context. "Scientific discourse is traditionally seen as impersonal transmission of knowledge." Now we focus more on the persuasive rhetoric of academic discourse.

He talked about signalling nouns like case, evaluation, procedure. These nouns signal to something else in the sentence that realizes the noun. Eg. "The idea that god is dead appeals." The "idea" is the signalling noun, and "god is dead" is the realization. These nouns also act as labels to that which realizes the noun. Could we say that they act as metadata.

He talked about different methodologies. He moved out to multiple perspectival approaches which combines methodologies and theoretical perspectives.

There is an imperative to publish in English. This is due to massification of education, globalization of scholarship, national prestige, university league tables, career development. Most scholars trying to write in English don't have English as their first language. This has led to a demand for research and learning on how to publish academic English. Flowerdew worked with a number of academics in Hong Kong and got drafts, final texts, feedback from editors/reviewers, and so on. He also interviewed people and tried other methods.

Flowerdew talked about a project that was published in 2000 on "Discourse community, legitimate peripheral participation, and the non-native English speaker..." He used different data sources, different methods, different theories and different perspectives.

He then talked about data-driven learning where learner works directly with data and the teacher provides strategies for learner to learn. Students interact with a corpus. The student becomes a language detective. Flowerdew introduced this approach to PhD students in Hong Kong.

He concluded by talking about how in a grounded-theory approach you don't start with a theory but you let the data help you develop a theory.

Wolf Schünemann: Data and Discourse around Europe

Wolf talked about data and computational methods in political science. He talked about Chris Anderson's "The End of Theory" article in Wired. The challenge is whether we need theory or can we just generate theory from data. The social sciences are being datafied as large datasets become available where before there had to be expensive polling/surveys.

Can computational methods help us understand the relations between actors and how power and influence is deployed? Does social network theory or discourse network analysis work? There is, of course, a lot of problems with the data-driven enthusiasm. He framed the important question for computational social sciences:

How can answer meaningful questions in explainable ways?

He then shifted to concrete questions starting with EU politics and the question of whether European politics play a part or if national politics are more important. Wolf is particularly interested in referendum issues. He uses twitter data and talked about the biases of twitter. He feels it is a good measurement of transnational communication. He also collected texts around Yes and No sides on EU referenda. He works with theories about argumentation - a political argument will ask people to vote X and provide arguments and subarguments.

One of the techniques he applied was topic modelling which to him felt like getting a Borges list of animals. That said, he felt he was seeing something. He is now trying structural topic modelling. See https://www.structuraltopicmodel.com/ .

Closing Panel

Our host Dr. Marcus Müller closed the conference by asking questions of all of us.

What is the minimal curation of data that we need to do? I argued that we need to add metadata. I also argued that we need to be careful about trying to document too much, because then we won't get anything done. We should archive what we can document. We also have to think about the ethics and probably should have an ethics statement.
How will big data change discourse analysis? We discussed whether we should be cleaning big data. Big data also raises issues around reproducibility of results. One feature of big data is that it will almost always be exhaust that doesn't come with consent. We therefore need to think about the ethics. Of course, we need to learn statistics and visualization.
What standards of interpretation should there be? I would say that it is all interpretation. The scale of the data doesn't change that. I should add that a number of speakers talked about data-driven interpretation. The interpretation should be driven by the data or at least the phenomena, not the other way around. Someone argued for transparency, consistency, reflexivity. Evelyn Gius mentioned the iterative hermeneutic circle. Audrey Alejandro wondered whether we can automate interpretation.

I believe there is a tension between computer assisted interpretation that builds from the text, which is typical of the humanities, and approaches that diagnose the text (as in sentiment analysis).

I rather liked the format of questions we could all answer.

Data In Discourse Analysis