American Association Of Corpus Linguistics

I was invited to participate in the AACL conference hosted by the University of Linguistics.

Note: This is being written on the fly so my notes may have typos and certainly will not represent the richness of the conference.

Brian MacWhinney - "TalkBank - Reintegrating the disciplines"

The first plenary was given by Brian MacWhinney and he talked about the TalkBank resource. He talked about different types of integration:

Greek Integration - Aristotle and Plato
Renaissance - DaVinci, Descartes, Bacon and Leibniz
Modern Integration - Systems Theory, AI, Emergentism, Bayes - A central role for data-driven mechanisms.

The amount of data challenges us to integrate. We can integrate data and approaches to data.

He talked about Principles and Goals:

Data-sharing - we have to share our data
Multimedia - we need to add multimedia
Open access - where possible data should be open
Interoperability - data should be capable of being used in different tools
Integration of disciplines

Core ideas:

Human communication is a single unified process that is looked at by 20 or more disciplines. Different approaches have different time scales.
All the processes have their effect in The Moment which we can capture on video

They are trying to integrate projects to build a dream database that can be useful to all sorts of disciplines. See ComNet proposal at http://talkback.org/dreams

TalkBank is a web-accessible multimedia database. There is also a list of tools.

Some of the other large databases are:

Human Genome project
Sloan Digital Sky Survey
Alzheimers Neuroimaging
fMRI Data Centre
The Human Connectome - a map of all the white matter activity in the brain

Big Science has been proposing different large scale databases. Can we do it with the messy data of the humanities and social sciences.

Mark Davies - "Using robust corpora to examine genre-based variation and recent historical shifts in English"

This talk wasn't part of the conference, but was a subset given for the Humanities Computing Resarch Colloquium.

Mark from BYU has created a number corpora in English, Spanish and Portuguese. See http://corpus.byu.edu/ . For example he has a corpus of Time magazine articles.

He talked about COCA (http://www.americancorpus.org), a monitor balanced corpus of English. It is balanced across 5 genres and across years. It is designed to look at recent changes in the language. How can the corpus show evidence of change without coughing up thousands of items.

A monitor corpus has to be big. It has to be recently updated (COCA is updated yearly.) It has to have an architecture that allows comparisons across genres and time.

He made an interest point about the problem with xml and TEI and implementation to the effect that XML doesn't scale for very large corpora. You have to use databases to get adequately fast access to large corpora. I asked him if XML might have a place for archiving and interchange of data and he agreed.

Stefan Gries - "Corpus Linguistics and Theoretical Linguistics: A Love-Hate Relationship?"

What is corpus linguistics? Some say it is a theory or philosophical approach. For some it is a methodology with theoretical implications.

The corpus linguists see it as a way of building a theory.

Gries thinks it is methodological paradigm named after its major research tool and data source. What would it mean to have a truly bottom-up corpus-driven approach. Many projects start with words which are a pre-theoretical commitment. How would the data truly drive method. The corpus-based and corpus-driven approaches are not really that different because there isn't really anything like corpus-driven.

Gries feels validation of methods needs to be done. We have lots of measures of collocation, but little comparison and validation of these.

He feels that we have to turn to cognitive science to help explain how people process information. He wants to bring together the humanistic corpus based approaches and the cognitive approaches. This went by too fast but he seemed to be assuming that we cognitively process discourse. When corpus linguists talk about patterns they are very similar to what happens cognitively. Corpus linguistics and psycholinguistics should work together and test their results against each other. I wonder if there is a similar correlation between social linguistics and distributed cognition.

The model he likes is an exemplar based approach. He suggests that we have an n-dimensional space in our mind where we store all examples of language traces, updating it with each new input. We don't remember everything, but forget things and remember some sort of statistical space that guides language production and against which we compare new language we hear. Speakers and listeners store large amount of probablistic information which they refine over time and use. I'm not sure I got this part right.

He showed a multi-dimensional cloud space - but what does it mean that we have such spaces in our mind. Do we have such clouds in our mind? The suggestion is that the mind has a rich concordance that it mines for patterns.

He closed by quoting Halliday to the effect that corpus linguistics is a highly theoretical pursuit.

His site is at http://tinyurl.com/stgries