Main »

Digging Into Data Challenge 2011

I'm at the Digging Into Data Challenge conference that is bringing together the investigators of the first round of the Challenge. I was the Canadian lead on the Datamining with Criminal Intent project. What follows are my conference notes. They are therefore incomplete and rough. You can also see the twitter feed searching for #DiD11 or look at Jen Howards notes at . A group photo of the Canadian respondents, grant council folk and investigators is at .

Before the conference proper started we had a meeting with CLIR who are evaluating the programme. Some of the points that resonated with me are:

  • Gender representation is an issue. The Challenge and in the digital humanities in general we need to work harder to involve women researchers, especially as leaders. We run the risk of DH being seen as the last bastion of old me in the humanities.
  • Representation by new scholars is also an issue. The Challenge should bring together the graduate students and the new faculty they need to be encouraged to meet up and they need the validation of attention from the research councils.
  • Supporting international research. One of the innovations of Digging is that it has one review process that crossed national boundaries. If your project was approved all the national partners got funded. We should see this model generalized beyond the digital humanities.
  • Encouraging research mashups. Another benefit of Digging is that it encouraged established projects to interoperation. The project Im on (Datamining with Criminal Intent) built interoperability between the Old Bailey project, Zotero and Voyeur.

The conversations circled around some of the cultural challenges of the digital humanities and how Digging could (or should) make a difference. We discussed issues of credit, team work, and the relationship between projects like those being supported by Digging and the "traditional" humanities. Is big data transformative or the return of old approaches quantified and on another scale?

Brett Bobley: Introduction

Brett started the event talking about some of the questions and challenges that motivated the Digging Into Data Challenge.

  1. Our ability to digitize materials has outstripped our methods for analyzing them.
  2. Simply having data for reading is not enough - we need computational access to run new methods.
  3. What can funders do to encourage the development of new methods on large data sets?

Not having a lot of funding Brett and others developed the idea of making it a contest or challenge. The competition was so popular that for the second round there are now 8 agencies supporting the Challenge.

Railroads and the Making of Modern America

Richard Healey started the project presentations on the Railroads and the Making of Modern America project.

Why railroads? Early global process that transformed landscape and economy of the US. They are studying the effects of new and global infrastructure on people, environment and economic work.

An important part of the project was data integration, standardization, and quality. They had to structure data to fit it together. They then developed a series of online case studies that draw on a data warehouse. Richard and others focused on many of the data difficulties faced. This is a problem in the humanities where each dataset is unique and hard to merge with others. On the other hand the data is robust and meaningful. This is one of the challenges of large scale digital humanities - we have deep data that is about real people, but it is too complex for the simple mining tools from other fields.

The Aurora Project is the underlying software infrastructure they built for this project. The idea is that "apps" can be built on top of Aurora for particular research questions. Aurora seems to be a sort of scholarly middle ware.

Will Thomas closed this presentation talking about how he is using the tool to help him read newspapers differently. He stressed how we can now visualize data spatially. Will compared his project to one from the 70s/80s that focused on railroads in Vermont. That project was the work of a solitary scholar and it left behind a PDF of a table of data. The Railroads project is a team project that has far more data and it is presenting it back in multiple ways. Will talked about having students learn through enriching the data. Will argued that we now face a social change in disciplines like history of recognizing team work. Historians until recently used space for illustration and they used the power of narrative to bridge gaps. Now the interactive map is reshaping scholarly practice.

Respondent: Peter Baskerville

Peter Baskerville was the respondent on the Railroads project. He pointed out how hard it is review changing targets like scholarly datasets. He commended the authors for their frankness on the tedious work of structuring databases. We may be overwhelmed by an abundance of data, but we have to avoid being drowned. We have to find ways to integrate and crosswalk data. The assumption of abundance might close down interesting research that adds to the stockpile of abundant data.

Peter talked about the mistake of calling for a revolution. Traditional humanities work like ata collection, careful coding and interpretation are still needed. Big data needs both human and automated handling. It is not a revolution, but leads to a cyborg of human and automated method.

Is representing a change in the form of history? Information visualization is a hot topic, but what does it show? Peter is critical of the rhetoric of the digital humanities, especially around visualization. What does it include and exclude? Why the focus on visualization? Peter is interested in understanding and causation. Visualization is not understanding, though it might lead to it? Understanding is more than seeing patterns and shapes. Causation can't be seen, only inferred. (I would counter that causation can be represented in text or visualization.) Regression analysis is rarely discussed in the digital humanities even though it is a way of testing causal inferences.

Harvesting Speech Datasets for Linguistic Research on the Web

Mats Rooth started the Harvesting presentation which takes advantage of the fact that you can search the internet for strings and get sites that have audio. This allows them to harvest audio for linguistic research. There are lots of media sites that will give you the time offset for an audio passage and let you download the audio. This lets them be harvested for hundreds or thousands of tokens for a word sequence. He gave as an exmaple, audio clips of people saying "than I did".

Linguistics has been going from armchair experiments (sit in an armchair and ask yourself how you would say it) to lab experiments (record others saying it) to large web experiments (harvest lots of examples of others saying it.) The web harvesting, however, takes hand checking by an RA.

There is a machine learning component to this project. Jonathan Hull talked about the classification experiment. Too complex to describle (I have to listen carefully), but fascinating.

Michael Wagner talked about trying to validate the outcomes from the study of harvested clips. You can use labs, but language behaviour in the lab is sometime different from that generated spontaneously. Harvested (and spontaneous) experiments and lab experiments can compliment each other.

Respondent: Jennifer Cole

Jennifer was the respondent. She asked about the value of speech (audio) datasets (over all the text available.) Prosody research is one case of where you need audio. She argued that the Harvesting project has reduced the costs for researchers who need access to such audio datasets, though it has problems.

Digging into the Enlightenment

Dan Edelstein and Chris Weaver presented on their project visualizing enlightenment correspondence. It is one of the Mapping the Republic of Letters projects.

Chris Weaver talked about visualization. He argued that viz is not just a static artefact - it is a verb too. It is a process. He also argued that a visualization is not a representation. He recommended the book Illuminating the Path which talks about visual analytics which he feels is a methodology that crosses disciplines.

Chris asked why you can't manipulate data in a visualization? It is essentially a browser, so why can't we annotate, manipulate, and cycle stuff back into the visualization. All data is annotation. So Chris has come up with "ampliation" which is annotation and interpretation. It means to enlarge and extend.

He showed some demos of very cool visual tools that tended to combine multiple panels that interact much like Voyeur does, but customized for the data. See his Improvise site for the tools and applications.

Then they presented a visualization design that was developed by a design student from Milan. This allows one to layer filters into a query that controls a visualization.

Respondent: Stephen Nichols

Stephen addressed the issue of how we deal with correspondence projects. Letters can be trivial or important. One needs to "ampliate" the data with context and interpretation.

Day 2

Alastair Dunning, JISC

Alastair talked about what is happening in the UK and what JISC is supporting. He started with studies about utilization starting with the work from UCL on Log Analysis (the LARIAH project) that showed disappointing use of online resources. This 2006 study was followed by Splashes and Ripples (2011) which showed significant improvements.

Structural Analysis of Large Amounts of Musical Information (SALAMI)

Ichiro Fujinaga started talking about the SALAMI project which gathered "ground truth" data about what educated listeners would think is the structure of a musical work. This is to test music recognition algorithms. They double key annotated 1000 songs of various genres. They used Sonic Visualizer (and open source tool) for the annotation.

Ichiro showed a screencast of a PhD student doing the annotating. It was impressive to see the student parsing a rock song.

David De Roure talked about the structured analysis that they ran on the deluge of data available. Using the student-sourced ground truth they could created linked data repository to support scholars in a sustainable way. The idea is to share standardized linked data so that it can be mashed with other data. There philosophy is that the web is a content management system and their website is an API. To do this they developed an ontology of music segmentation.

David talked about the interesting relationship with the broader community that is interested in music. It strikes me that the musical community rivals the genealogists in their engagement in citizen research.

Steve Downie talked about MIREX - the music information retrieval challenge and exchange. They agree on challenges and then compete to generate the best algorithms. The algorithms are then run on a large music database to compare them. Thus Steve's team was able to compare segmentation algorithms against the ground truth. He showed a visualization/sonnification that compares the ground truth segmentation to the different algorithms.

Respondent: David Huron

David talked about what has happened in Genetics. They have repeatedly hired people from very different backgrounds in order to tackle the key questions. They have hired computer scientists to bring new thinking into the field.

The humanities has lost several disciplines like psychology and linguistics to science. The questions in linguistics have stayed the same, but the methods have changed.

Disciplines should be defined by questions, not practices. If we define the humanities as a set of close reading and interpretative practices then it will never change. If we think of it as the disciplines addressing questions about the human and human expression then we should be willing to adopt new methods and hire people out of new areas.

David went on to talk about how tools need audiences. Good tools don't succeed on their own. You need to build an audience. Focus on the questions and discovery not technology. Email in the late 1970s was useless because few others had it. Now it is useful because everyone you want to correspond with has it. (Of course the spammers have also found a way to make it problematic again.)

In the arts and humanities we supposedly put a premium on community and human interaction and yet, paradoxically, we don't do it. David argued for collaborative practices. Central to collaboration is confessing ignorance. In the humanities we don't dare confess not knowing anything. We overvalue the pedantry of knowing everything (or pretending). We socialize our students to mask their ignorance.

David then talked of the danger of exploratory tools. He gave the example of the theory of continental drift that double uses the data (where the data used to generate a hypothesis is then used to prove it). We need to build tools that only show a subset of data so that you can then test on the full dataset.

There were questions about this idea of double use in cases where there isn't more than one (universe or Shakespeare.) You can really test results against a multiple other phenomena when there is one. David seemed to think that hiding data from yourself is a way to have control sets. This seems artificial. Once you have adapted your hypothesis to fit the full dataset you are back into the same situation of using one instance to form a theory.

Data Mining with Criminal Intent

I was part of the presentation on the Criminal Intent project so I couldn't take notes, but you can see the slides at along with instructions on how to do it yourself.

Respondent: Stephen Ramsay

Steve reminded us of the history of text analysis and visualization. He reminded us of our call to have more playful experimentation. Steve talked about how we are indebted to science in the project. He drew attention to how we argued that we would use scientific tools to tell new stories. Human stories are what the digital humanities are about.

He redeployed a question from before: "Is it not art?" While that was asked of visualizations before in a sarcastic fashion, Steve asked it again with respect. Is what we are doing telling new stories an art?

You can read the full text of his paper at

Tom Jenkins: Bringing Humanity to Data to Create Meaning

Chad Gaffield introduced Tom Jenkins from Open Text. Tom is the Executive Chairman and Chief Strategy Office of Open Text Corporation which evolved out of the New Oxford English Dictionary project at Waterloo. He is now in the SSHRC Council. Tom talks eloquently about the importance of the humanities in the information revolution.

Tom distinguished between tool makers and tool users. The STEM community are the tool makers. The social sciences, humanists, and artists are the users. One wants to have both in society.

Tom then talked about the beginnings of Open Text and how they dominated the web search business for about 3 years. Most of the web is now behind firewalls (the dark or deep web.) Open Text builds technologies for the deep web.

He then switched to talking about the impact of the cloud. He argued that Web 3.0 is the move from the social cloud to the semantic web. The cloud is making rich mobile and social devices available. We are amazed when we first access them and then a year later they are outdated. Tom argued that the amazing thing about the shift from Web 1.0 to Web 2.0 is the rise of Facebook. We are social animals so it makes sense that Facebook would challenge Google. He talked about the tension between transparency (Facebook) and privacy (Wikileaks). This tension is a social science and ethics issue, not a technology issue.

Tom talked about the role of the humanities. One role is to bring critical voices that question the bullshit. Another role is to talk about governance. Another is to think about how the media are being changed. The cloud drives disintermediation. There is no longer a single big media channell for businesses to use to get to everyone.

An interesting fact he mentioned is the explosion of rules and regulations world-wide. Large corporations need to deal with these rules world wide which can be a nightmare.

The impact of the cloud is probably slowing down. Now is when the social and human innovations will start to kick in. He ended by talking about the Waterloo campus at Stratford where they are developing programs that teach technology, creative arts, and business together. They are also putting on an annual conference, Canada 3.0.

Mining a Year of Speech

John Coleman talked about Mining a Year of Speech project that dealt with the challenge of dealing with very large audio corpora. An audio corpora is going to be hundreds of time bigger than a corresponding annotated text corpus. John talked about the challenge of linking the audio to annotations by various types of people. They think of their year's collection as a grove of corpora (where each corpus is a tree.)

John reflected on large data and the deluge of humanities data coming. Compare these big science projects:

  • Human genome: 3 GB
  • Hubble space telescope: .5 TB/Year
  • Sloan digital sky survey: 16 TB

To these one can compare some humanities projects:

DASS audio sampler: 350 GB Year of Speech: >1 TB Beazley Archive of ancient Artifacts: 25 TB

Our datasets show that the humanities have comparable if not larger collections. Our data is also messier and more interesting (at least to us).

Mark Liberman then showed some of their results, but they see this project as being of interest to people beyond linguists.

Respondent: Dan Jurafsky

The respondent started by reflecting on what happens with "micro-revolutions"? Large datasets can lead to research micro-revolutions. He gave a survey of what can be done across disciplines with lots of data. With very large datasets you can look at "lopsided scarcity" where in a long tail situation you want to look at sparse items. You can look at patterns that in a normal dataset would appear so infrequently that statistical inferences can't be made.

This project has advanced research on a reasonably forced alignment tool. Another technical problem that they tackled is anonymization.

He closed by talking about the collaboration of humanities and computer science. It is hard in both fields to get tenure for this type of work. Humanities scholars feel under attack. How can we entice folk in CS to take the humanities more seriously? At U of Alberta we are lucky that we have about 6 CS faculty interested in our work. It seems to be a cultural thing - a department with lots of people working with humanists will have a climate that is welcoming.

Keynote on Culturomics: Quantitative analysis of culture using millions of digitized books

Erez Lieberman-Aiden & JB Michel from Harvard have been working on the Google Books corpus and developed the Google NGram viewer as a result.

They talked about research practices in research. We can read a few books very carefully or we can read a lot of books algorithmically. They have been thinking about cultural evolution and change. They realized they could look at language change. They showed how irregular verbs tend to regularize over time, especially if they are used less frequently.

They showed some very interesting graphs of takeup of inventions (how is "radio" talked about after its invention.) They tracked fame over time (people get famous faster and get forgotten faster.) They tracked the careers that make one famous (political figures, authors, and actors do best.) They did a lot of work on censorship by the nazis and how in Germany certain people were suppressed.

The concluded by talking about culturomics: "the application of high throughput data collection and analysis to the study of culture." Reminds me of Lev Manovich's Cultural Analytics, though he is looking at non-textual data in many cases.

Towards Dynamic Variorum Editions

Greg Crane talked about what a variorum edition is. A colleague talked about why 2000 years of Latin is great for studying variation. They have crawled 1.2 M books from the Internet Archive of which 25 K are catalogued as Latin but many of them are not. He talked about the problem of polysemy when using large text databases. They trained a broad-coverage word sense disambiguation tool using parallel texts (English/Latin). Where you have a Latin work and its English translation you can train a disambiguation tool which can then be run on the rest of the corpus.

Bruce Robertson then talked about work with Greek. He talked about his workflow for processing Greek texts. Because OCR doesn't work well on Greek they used students to correct stuff.

John Darlington talked about creating high-throughput infrastrucrure for OCR and text-based feature extraction for Greek and Latin. He talked about e-science frameworks and how they can be developed for supporting projects like this. Another colleague talked about how e-science could be applied to large-scale OCR. The key is minimizing the need for human intervention. To do this one needs ground truth that can be used to train the OCR.

Greg Crane closed the presentation. He talked about how we need the participation of youth to transform our intellectual culture. We have to invert the hierarchical virtuoso culture of classics so that student researchers can do meaningful work (instead of being told that they can't contribute until they have a PhD.) Citizen scholarship is the future.

Respondent: Cynthia Damon

Cynthia chose to address the team. She talked about mongrel texts and how the first generation of printed texts were problematic. Many of the important texts have no modern edition and the digitization of what is there will mean that many texts go from their medieval mongrel phase to electronic form without modern editing.

She asked what do you do with a million text? It turns out that you can break them down to their words. What do you do with a billion words then? She feels it harder to swallow the distant reading techniques and questions that comes when you have billions of words. She asked questions about what she might be able to do. She asked about words and their range of meanings. She had a general request for a more thorough examination of the effects of OCR errors. She closed with a plea for both the large scale tools and provision for the input of scholars (in addition to citizen participants.)

Digging into Image Data to Answer Authorship Related Questions

Peter Ainsworth started by talking about the complexity of authorship. It is in the 15th century that authorship emerges as a significant designation. We care about authorship because it is key to understanding cultural production. Their challenge was to look at authorship through 3 very different image corpora (manuscripts, quilts and maps.) In all three cases they don't who produced the items, though they think they might. They designed image analysis algorithms to extract features and then classify images.

Peter talked about some of the challenges of collaboration across multiple sites. They chronicled their journey in a First Monday article. One thing that helped was a memorandum of understanding at the beginning. This covered permissions and credit.

Peter then talked about the medieval manuscripts. The manuscripts while being produced in one spot are now dispersed. We can bring them together in the cloud. Their main qarry are the artists and scribes that created the manuscripts. They created an image tool for examining the pages. The Art Historians try to define traits of a master artist. They are trying to assist in the tracking of artists. Part of the issue is segmentation so one has smaller shapes (heads and helmets, for example). They also used colour space analysis. Likewise they are interested in the scribes and their particular orthography. Doing it by hand/eye is difficult. Could digital techniques help? Could it help with the identity of the shadowy figures who copied manuscripts. They applied sobel edge detection.

Dean Rehberger talked about 19th and 20th century quilts. He commented on how important the young scholars were to the success of the project as were computer scientists who cared. In their quilt database they have tens of thousands of items with lots of metadata. One question they addressed was whether they could identify which were "crazy" quilts - a Victorian invention. They were produced by women and not as regular. They have a crazy explosion of shapes and colour. Segmentation was again an issue. Their algorithm got to about 70% accuracy. What was interesting was the group of false positives. Now they are trying to see if they can determine if something is an Amish quilt. They are also interested in how quilters take up ideas from each other.

Peter Bajcsy then talked about dealing with maps of the 17th and 18th century. One thing they did was to use the neatlines for scale rather than scale indicators (which the computer can't find easily.) The neatline is the frame around the edge that often has ticks that indicate scale. He showed a table showing the great lakes over different maps and how close the map area for each lake was compared to the actual area. Can this be used to tell how accurate the map as a whole are? Can they tell things about maps from different countries (English vs French.)

Peter then talked about the computer science side of things. The memorandum of agreement was important when negotiating across disciplines. He also showed how they recorded skype sessions for future reference. They also shared a software repository. The repository also was used to share algorithms. Segmentation tended to be different from different types of images, but the algorithms for edge detection could be similar. They are developing a statistical framework that can tell you how much confidence you should have in the statistics about the segments.

Peter Ainsworth concluded with results from the subprojects. This project identified a unique colour space palette for the manuscript artist/master known as the follower of the Rohan Master. For the quilts they found that Victorian crazy quilts are similar to modern quilts along with insights into colour. For the maps they have results that help them understand the differences between French and British maps. They found that French mapmakers had a better grasp of how climactic features affect the topography. An unexpected insight that they hadn't thought of.

Humanists haven't had the chance to think of such things before, but they could help with authorship. Authorship remains a useful concept for anonymous works. Collaboration with non-humanists forces humanists to make explicit their methodology of visual inspection.

See for more.

Respondent: Sha Xin Wei

Wei took the chance to talk about disciplinary practice. He feels that the next step is to co-develop of tools with interpretation. Human-in-the-loop machine processing to humanist-in-the-loop. He commented on the need to interpret and critique the tools too. We should turn our interpretative skills on our own tools. What you see might be what you expect to see.

He talked about tertiary orality. There are lot more people with mobile phones than internet on the planet.

He talked about the huge problem of signal analysis and semantic analysis. There are a lot of assumptions about what is signal and noise. He showed a video of a responsive environment with shallow semantics. He talked for a performative approaches. Meaning is constructed in performance. How can we use that guide development of tools.

He sees interpretation as a promulgation of scholarly dialogue. Are graphs really sufficient to the phenomena.

Finally, what is the unit of analysis? Maybe there are no primitives. Maybe there is experimental practice as a form of performance. Can we imagine an experimental form of humanities where we build the very things we are studying.

Patrick Juola asked the team whether what they were doing was closer to genre analysis than authorship attribution.



edit SideBar

Page last modified on June 10, 2011, at 03:54 PM - Powered by PmWiki