Digital Humanities 2016

These are my conference notes on DH 2016 in Kraków. Like all my conference notes these are being written live and will be full of bizarre run-on or run-off thoughts. Send me corrections!

The Twitter hashtag is #dh2012 and there is an archive at July 13th Archive thanks to Ernesto Priego.

The abstracts are at: http://dh2016.adho.org/abstracts/

Sunday, July 10: New Scholars Network

On Sunday I was part of the New Scholars Symposium which was supported by CHCI and centerNet. These two organizations allowed us to bring to Kraków 15 graduate students, postdocs, and new scholars for an unconference on the digital humanities. Some of the subjects we talked about included:

Ubiquitous Computing: VR, Augmented Reality, Surveillance, Privacy
DH in the Humanities
Pedagogy

Monday, July 11: Innovations in Digital Humanities Pedagogy: Local, National, and International Training

I was lucky to be included in a mini-conference and member meeting sponsored by the International Digital Humanities Training Network / ADHO Training Group.

The day started with historical and organizational comments by Ray Siemens Diane K. Jakacki, and Katherine M. Faull. DH Training is a netowkring group that is growing organically.

Building a European DH Pedagogical Network

Walter Scholger (U Graz) talked about the Dariah-EU working group on Training and Education. They are doing a lot of work coordinating things. Walter showed the DH Course Registry <https://dh-registry.de.dariah.eu/> where you can find out about courses and programmes, mostly in Europe. Anyone can use it, and you can get a login if you want to add your courses.

Stef Scagliola gave a Dutch Overview of Digital Humanities. She talked about why we should track the emergence of a new discipline like digital humanities. She talked about how people aren't using all of the richness of TaDiRAH. This is a problem with metadata - people aren't used to using classification systems in disciplined ways.

Toma Tasovac (BCDH) talked about how There is Something Rotten in the State of Training Materials. He talked about how we don't share, we don't archive, and we don't give credit. The symbolic value of training materials is low. He is part of #dariahTeach that is funded by DARIAH-EU. Their project is much more about dynamic learning objects than a MOOC. They use NeMO (NeDiMAH Methods Ontology.) He then talked about what happens when the grant money runs out. They are trying to develop social sustainability - a neat idea that involves making it into a journal that then provides the symbolic capital.

Lightning Talks

Jennifer Edmond talked about PARTHENOS, a cluster project of European infrastructure. It brings the projects together under a programme of activities. Training is important to infrastructure. They are training about infrastructure - looking at concepts like interoperability and sustainability. They also look at epistemic cultures and misunderstandings across cultures. For all of these there are short online and longer F2F versions of the training right up to a Research Infrastructure Boot Camp.

Francesca Benatti and Paul Gooding talked about CHASE (Consortium of the Humanities and Arts South-East England). They don't have any formal DH centers but they are collaborating with GLAM institutions and each other to embed DH into other training structures.

Franz Fischer talked about DiXiT – An Innovative Marie Skłodowska-Curie Training and Research Programme in DH. Digital scholarly editing is one of the most mature fields in DH. DiXiT is the Digital Scholarly Editions Initial Training Network. They offer a core, modules and support for fellows. They have camps to meet the needs of fellows and students. They then support additional events endorsed by the partners.

I'm impressed by all the European training initiatives. The EU is obviously putting money into research training across institutions.

Paul Spence talked about Cultural Diversity in the Digital Humanities Classroom. He talked about internationalizing the curriculum and mentioned Leask, 2013 on campus culture of internationalization. Paul is starting a process to audit their MA in DH as to its internationalization. He noted that their curriculum is very Euro-centric. It is a programme in English and based in European textual traditions. And yet they have lots of students from China where there are different traditions at all levels. The same for India. There are constant issues around assessment and critical writing. There cultural issues around interactions between students and between students and instructors.

Gimena del Rio Riande and others talked about DH Training in the Spanish Speaking World: When Digital Humanities Become Humanidades Digitales. They talked about LiNHD in Madrid. They are creating expert certificates. They look at English materials, but have found that just translating them doesn't work. They need to be adapted and contextualized for their needs. Most of their students can read English, but they need to be careful about which materials they assign.

Orla Cork talked about The Pragmatics of Teaching DH as a Discipline at UCC. Orla talked about creating something out of nothing with a shrinking budget. They have an MA, an online MA, and now a BA. They have brought in all sorts of disciplines from philosophy to geography. It is a big and inclusive tent. She talked about collaborations with the GLAM sector. They now have some of their own space and are trying a Bring Your Own Device technology. She talked about how they are not assigning essays, but digital outcomes. Every course has three aspects - a personal, a presentation, and a collaboration. The outcomes will include critical writing, but not necessarily essays. Their philosophy rejects the binary of humanities and sciences.

Susanna Allés Torrent talked about DH Integration in a Modern Languages Department. At Columbia their integrating DH into the courses and programmes of the Department of American and Iberian Cultures. This integration responds to the reality that language departments are hiring people that have experience with digital methods. How does one integrate DH into language and culture curriculum? What are the students interested in? One of the interesting things in her seminar plan is the teaching of metadata for GLAM community.

Elisa Beshero-Bondar presented on Training Faculty and Students to Learn and to Teach “Coding Across the Curriculum.” She is the head of a new center. Her institution is small, but this allows them to be digital studies (broader than DH in that it includes digital media.) Students build projects on the public web. Students like to build real things that others can see. Students, colleagues, and others are being trained together in a institutional bootstrapping project.

Anouk Lang presented on Breaking the Mould of the Essay: Using Digital Projects in the English Literature Classroom. She talked about the projects she has students work on. Again they work in public. One of the challenges is getting students not to write linear essays on the web. Her challenge is sceptical of anything that is not an essay or exam. She mentioned a peer assessment method that her department borrowed from her course.

Katherine M. Faull and Diane K. Jakacki presented on Reaching Across the Divide: Building Curricular Bridges to Meet Undergraduate DH (Learning) Goals. They talked about challenges reach out to CS. They work at an undergrad institution that is STEM centric. They developed a minor and promoted it on social media that got them good attention. They talked about how there is a perception that DH integration takes a lot of work and therefore shouldn't be tried by junior colleagues. Therefore they had to create separate courses. They talked about how their minor is also interesting to CS as the humanities bring real problems to help students learn about real projects.

An issue that came up was translating technical terms and code like the TEI.

Ray Siemens closed talking about All Ships Rise with the Tide: Partnership in DH Training.

Member Meeting

Some of the issues mentioned included:

Problems with standardization of credits and therefore their transferability
How can we collaborate when our training is so different (from workshops to MAs)
How can we handle change coming from other areas (students and other programmes)
How can we communicate about learning objects
Could we develop an ADHO seal of approval for learning modules? What would it take to organize something like that?
We talked about sharing and exploitation. How to avoid drawing of people without rewarding them in some way.
We talked about how to share actual teaching strategies

Panel: Publication Approaches Supporting DH Pedagogy

I was part of a final panel on a collaborative publication. Natalie Houston (U Massachusetts Lowell) started us off and gave an overview. The project is Digital Pedagogy in the Humanities edited by Matt Gold and others. It is a curated collection of downloadable and reusable and remixable pedagogical artifacts. There are 50 keywords and I am working on Visualization with Stéfan Sinclair. It will be published by MLA Commons and is in the middle of open review. Any one can review. All the editing is done in github.

Natalie then talked about the keyword she is curating, Text Analysis. She talked about how everyone in the humanities talks about "reading" texts for all sorts of practices. She broke her curated collection into 4 categories starting with 1) Digital Pedagogy Unplugged where we can have . 2) Text Analysis Tools and Methods covers various tools. 3) Text Editing as Text Analysis category looks at how annotation is part of the analysis. Finally 4) Communicating Text Analysis looks at ways to teach students to use multimedia to communicate.

I talked about the Visualization keyword that I am working on with Stéfan. Piotr Michura talked about the Prototype keyword that he is doing with other including Stan Ruecker. He talked about what a prototype is. They are often tangible and temporary (in the sense they are not designed to last or be kept.)

Production-driven prototypes
Experimental prototypes - designed to address a research question
Provocative prototypes - designed to change thinking

He gave some examples like a set of prototypes from The Stigma Project where students worked with AIDS data. Another project was a set of physical structures that are provided to students for modeling a text or idea. Stan Ruecker talked about this in his 3DH talk The Digital Is Gravy . Piotr then showed hope people choreographed use of prototypes with figures and shooting video with a smart phone. The general idea is to find ways to hold more than one opinion.

Tuesday, July 12

CWRC & Voyant Tools: Text Repository Meets Text Analysis

Susan Brown, Stéfan Sinclair and I ran a workshop showing CWRC and Voyant. We showed the beta of the CWRC collaboratory and then Voyant Tools 2.0. My script for what we presented on Voyant is at http://hermeneuti.ca/intro-workshop . We talked about the new features in Voyant 2.0 including:

How you can switch languages. See http://voyant-tools.org/docs/#!/guide/languages
How we have a new and improved help: http://voyant-tools.org/docs/#!/guide
How you can use something like regular expressions when selecting words or collocates
How you can manage a corpus within Voyant
The new corpus uploading/management options: http://voyant-tools.org/docs/#!/guide/corpuscreator

We then showed how one can launch Voyant on texts in CWRC. You can see a version of this here: http://voyant-tools.org/catalogue/cwrc/?facet=facet.extra.collection,facet.author

Opening Ceremony

We were welcomed by people including:

Karina van Dalen-Oskam, ADHO Steering Committee Chair
A Dean from the Jagiellonian University, Boguslaw Skowronek
Dean of the Fraculty of Philology, Pedagogical University
Manfred Thaller, Chair of the Programme Committee
Maciej Eder and Jan Rybicki, the local organizers
Seth Cayley, Director of Research Publishing for Gale.

Seth talked about how Gale is now sharing data for any of their collections with researchers at libraries who have subscribed. He gave examples of projects using the data. He talked about their n-gram viewer and showed their sandbox which is coming.

We have 900 registered participants, the most of any DH conference.

Jan Rybyicki then introduced the keynote speaker.

Agnieszka Zalewska: Can CERN serve as a model for Digital Humanities?

During its 62 years history, CERN has grown to become the biggest and most globalised sceince organization. CERN was founded in 1954 by 12 European countries with the idea of "Science for Peace". The idea was that the atom could be used for peaceful and scientific purposes. The effort also aimed to stop the brain drain to the US.

There are now 2300 staff and another 1400 others including many students. There are more than 12000 users.

CERN was created in difficult times by visionaries, both scientists and diplomats. They have concentrated on ambitious projects with well defined deadlines. Important was sustainable support and collaborations with industry. Diversity of scientists and engineers led to a creative environment. They also promote their knowledge and technology.

She then gave us a tour of the different types of devices CERN has to accellerate and decellerate matter. They are going deeper and deeper into the structure of matter. She talked about the filtering of interactions to get at the very few bosons. CERN experiments

The hicks boson is a unique particle. They filled the whole universe right after the big bang. They now undestand 5% of the mass of the university. The remaining 95% is dark matter and dark energy. There are fundamental questions that need to be answered about dark matter and energy.

If you have living infrastructure then new ideas appear like projects looking at how clouds form.

She then switched to Dan Brown's Angels and Demons where a gram of antimatter is stolen that could be used as a weapon. She pointed out how CERN's creating antihydrogen is not dangerous.

Then she talked about technology. CERN shows how if you develop technologies then you can transfer them to society. The history of synchrotons and cyclotrons shows that they are now used widely in industry. She gave other examples of transfer. They are now thinking about how to reduce power consumption.

Of course, one of the best examples of transfer is the Web that was developed at CERN. Zalewska was one of the first hundred users. The web was not patented and is freely available thanks to CERN. Now they have the LHC Computing Grid which has 500,000 CPU cores and runs millions of jobs.

CERN education activities are important and they now have a down-up process for particle physicists to decide on a European strategy in a global context. What to do next? The idea is to have a binding strategic process that involves research councils and physicists.

Instead of conclusions she suggested that absolutely essential for good science is outstanding visionary scientists mentoring young people starting their adventure. If people start their adventure with good guidance they will go so much farther.

She hoped that CERN can serve as a model for the digital humanities.

Wednesday, July 13

Maciej Maryl and Maciej Piasecki: Where Close and Distant Readings Meet: Text Clustering Methods in Literary Analysis of Weblog Genres

They started by talking about weblogs and how they can be categorized. They showed a visualization of clusters of categories. This paper tries to take it further and ask what genres there are in blogs. Their scheme is to treat blogs as social action follwing Miller. From this they proposed a conceptual typology of blog genres:

Diaries
Reflection - subjective discourse on universal matters - poems
Criticism - subjective and expert discourse on general issues - collumns
Information
Filter
Advice
Modelling
Fictionality

They developed this through close reading. Then they decided to use a distant reading approach that might validate the genres. They used linguistic approaches like most frequent words (Jannidis and Lauer 2014). There is a muddy issue about genre vs register.

They ran a number of studies fromm qualitative analysis, coding, validation through clustering, indentification of characteristic features. Polish offers challenges being a highly inflected language which suggest that they should go beyond most frequent words. They identified features from punctuation marks, lemmas, grammatical classes, and sequences of classes. They used the Polish National Corpus to identify appropriate words to use that didn't have semantic associations. They selected 250 blogs that had been manually categorized for automated processing. They tested different measures of similarity between documents. For clustering they used Cluto and Stylo.

They couldn't find good match to human categorization. Purity was 58%. When they added grammatical information it got higher and the cooking blogs clustered nicely. They showed a nice summary of the linguistic features that associate with different genres. Diaries had a lot of 1st person reflection.

They then tried to an automated clustering as an alternative set of categories:

Essayistic langauge
Cooking
Narrative
Expert/Columnist
Personal

I liked the iterative and potentially circular way they went from human categorization to automated.

Geoffrey Rockwell: Curating Just-In-Time Datasets from the Web

I then gave a paper. Obviously I could take notes on it.

Thomas George Padilla and Devin Higgins: Data Praxis in the Digital Humanities: Use, Production, Access

They looked at journal articles to see which ones were data driven. They looked at DSH, DHQ and JDH. The commented in how certain institutions are more likely to produce data-driven papers. Interestingly full profs and grad students seem to be publish the most.

They then addressed the question of why librarians shoud care. Thinking about developing resources means thinking about the users not only the materials. There is, nonetheless, a gap between curatorial practices and researchers. Libraries will have users that jump all over the place and they need to be supported. They talked about differences in how data is cited and so on. Most are citing text sources! Very few citing audio, video, images. Conversely all sorts of resources are being used only once. Often the data isn't really available. In theory materials should be available, but often they aren't really accessible. They found only 10 instances of data available through a research depository.

They then used their data to look at collaborations. The collaborative universe is Professors, Associate Profs, Assistant Profs and Graduate Students. Then there are various itinerant researchers. Each caste seems to be collaborating with their own. Grad students with grad students, profs with profs. English to english, computer science to computer science. There was more collaboration in data-driven papers than non data-driven.

We had an interesting conversation about collaboration and data.

Salvador Ros and Gimena Del Rio: Researchers’ perceptions of DH trends and topics in the English and Spanish-speaking community. DayofDH data as a case study.

The authors talked about the Day of DH project and how it has evolved. The last edition was hosted by LINHD in Spain and was on a bilingual platform. The Spanish was called DiaHD and this can cause confusion. When they gathered materials they found lost of materials beyond just the Day of DH archives.

The data scraped was not consistent and not all data was available. In some archives the comments are not available. They showed some very interesting statistics about the different days.

They talked about the language issues. For a year or two it was transitional. Now it seems that it is becoming more multilingual. 2013 was important in the transition and important to communities like those writing in Portuguese.

Time stamps are an interesting problem, but they could show us when people post. They also did some social network analysis showing that there is . They did sentiment analysis and found most posts are positive.

Some of the high frequency words are Reflecting, Teaching and Travel. In the Spanish version the event itself appears as a topic. Philosophy seems important to the Spanish edition along with social sciences.

Gimena concluded with a call for standardized data and a more flexible and multilingual platform.

Whitney Trettien and Frances McDonald: Thresholds: Valuing the Creative Process in Digital Publishing

The presenters talked about the contrast between the messiness of the process of writing to the polished structure of the final product. Projects usually hide their messiness and the texts in dialogue. Thresholds is a journal for criticism in the spaces between. The idea is to have the essay on one side and all the messy stuff on the other. Authors for the first issue are using the recto side in ways not anticipated so they are adapting to support them. People are using the two parallel columns to:

Translate
To trouble genres of discourse
To perform dialogues
To combine memoir and philosophy

Another intervention they are designing is on citations. They want to foment a different citational ethics. Authors and those cited are collaborating. People can create their own messy walls. Readers can recombine fragments in their own walls and upload.

Michael Eberle-Sinatra: Le Futur Du Livre Électronique En Accès Libre : L’exemple De La Collection "Parcours Numériques"

The digital is everyone and it remediates the analogue. He talked about the online editorial line Parcours Numériques. They are developing institutional politics to change the publishing of ebooks in the hope of bypassing the monopoly of certain academic publishers. Michael reminded us of the ways we researchers pay over and over for writing and reading our own publications. We need new models. He talked érudit and revues.org.

They are publishing shorter works that can still support complexes theses. They publish a paper book, a linear ebook for readers, and then a complex hypertext version that is free online. Their hypertexts have all sorts of non-linear threads. The reader of these doesn't read linearly, but becomes a "flaneur". The digital doesn't destroy our attention but we pay a different type of attention. Their online versions of the editions are free online, but you can also get a paper version or a PDF of the paper version. Open access online allows a form of international visibility. There are lots of English digital teaching materials, but fewer in French. This series allows for the publication of teaching materials.

Michael described the editorial process where editors work with authors to add a hypertextual layer.

Posters

I visited a lot of the posters. I've put up pictures of some of the interesting posters (by no means all) on my flickr account at: https://www.flickr.com/photos/geoffreyrockwell/albums/72157668340588253

Here are some of the titles:

Browsing, Sharing, Learning and Reviewing the Haine du théâtre Corpus through Insightful Island proposes cartographic maps of abstractions of literature.
EVT 2.0: a new architecture for critical editions in digital form has a very nice TEI publishing tool for editions with facsimile pages.
Making George Washington's Financial Documents Accessible: Transcription, Data, and the Drupal Solution

Thursday, July 14th

Panel: The Trace of Theory: Extracting Subsets from Large Collections

I chaired a panel on a project we did with the HathiTrust Research Center. Here is part of the abstract:

Can we find and track theory, especially literary theory, in very large collections of texts using computers? This panel discusses a pragmatic twostep approach to trying to track and then visually explore theory through its textual traces in large collections like those of the HathiTrust.

1. Subsetting: T he first problem we will discuss is how to extract thematic subsets of texts from very large collections like those of the HathiTrust. We experimented with two methods for identifying “theoretical” subsets of texts from large collections, using keyword lists and machine learning. The first two panel presentations will look at developing two different types of theoretical keyword lists. The third presentation will discuss a machine learning approach to extracting the same sorts of subsets.

2. TopicModelling:T hesecondproblemwetackledwaswhattodowithsuchsubsets, especially since they are likely to still be too large for conventional text analysis tools like Voyant (voyanttools.org) and users will want to explore the results to understand what they got. The fourth panel presentation will therefore discuss how the HathiTrust Research Center (HTRC) adapted Topic Modelling tools to work on large collections to help exploring subsets. The fifth panel talk will then show an adapted visualization tool, the Galaxy Viewer, that allows one to explore the results of Topic Modelling.

The panel brings together a team of researchers who are part of the “Text Mining the Novel” (TMN) project that is funded by the Social Sciences and Humanities Research Council of Canada (SSHRC) and led by Andrew Piper at McGill University. Text Mining the Novel (noveltm.ca) is a multiyear and multiuniversity crosscultural study looking at the use of quantitative methods in the study of literature, with the HathiTrust Research Center is a project partner. The issue of how to extract thematic subsets from very large corpora such as the HathiTrust is a problem common to many projects that want to use diachronic collections to study the history of ideas or other phenomena. To conclude the panel, a summary reflective presentation will discuss the support the HTRC offers to DH researchers and how the HTRC notion of “worksets” can help with the challenges posed by creating useful subsets. It will further show how the techniques developed in this project can be used by the HTRC to help other future scholarly investigations.

Web Historiography - A New Challenge for Digital Humanities?

I was part of a panel on web historography.

The first intervention was by Niels Brugger who showed the evolution of the city of Krakow web site. His talk had three parts.

Three types of digitality
Digitized collections
Historical web archive studies

There are different types of the digital. There is digitized versions of analogue materials. Then we have born digital materials. Then you have reborn digital materials that has been collected and re-presented.

He compared digitized colletions to archived web. With a DC you can go back to originals. DCs are usually done transparently and in a similar fashion from one archive to another. WA don't have originals - just a relationship with an ephemeral original. There are lots of decisions as to what and how to archive. The process is not as transparent. Lots of things can go wrong. You also have the dynamics of updating. The WA is something that never existed in reality - they is a temporal inconsistency. What we get are versions of something gone.

In a DC you can add hyperlinks and make a register. In a WA you have too much and too little. Links are a mess as they are inherent to the original and it is hard to haved a register. Archiving WA is based on the links and many of them will not be maintainable.

Niels showed examples of what you might want in a WA. The WA need us to help them understand what we might do with the archive - that will help them decide what to gather.

Niels then talked about historical web archive studies and gave some examples. You can have big data projects like the entire .dk domain or close reading of a small circle of pages over time.

There were questions about things like the archiving of Facebook. This is really hard to do.

Jane Winters went next. She talked Negotiating the Archive(s) of the UK. This is difficult as there are more than one archive. The British Library has three archives. An open curated collection that is free to access, but each page has been approved. There is a annual domain crawl, but that is not accessible. It is restricted to researchers at selected local centers. The Portuguese archive Archuivo.pt is one of the few

Big UK Domain Data for the Arts and Humanities (BUDDAH) project has mixed access. They have a ngram viewer. It is limited to keywords.

Most WAs are not set up for text mining.

The National Archives has an open and comprehensive archive for the UK government. The UK Parliament has another for parliament.

All these archives exist in isolation and they are just in one nation. It is a problem of abundance and redundancy that masks gaps. There might be many copies of one site and none of another.

There are serious challenges around the reliability of the dates of archiving. The scraping is not consistent so you don't have equal data per year. That makes ngram graphs flacky.

Ian Milligan spoke next on Web Archives are Great - But How Do You Use Them? Now we are limited by the WayBackMachine. It is like browsing the web not like studying a corpus. Now there are new approaches:

Computational approaches that depend on command line cloud access
We need lighter-weight environments for researchers

We know we need new search engines, but how do we want to use them? What we need is distant reading environments that let one zoom in and back out. The challenge is how to build good portals. Ian talked about his work on pivotal changes in Canadian politics. The U of Toronto has teamed up with the Internet Archive to gather a collection of important political web sites, but it is hard to use.

Ian is working with Shine (also used by BUDDAH) to create a portal at http://webarchives.ca . One can do much better research with this portal. You can separate by party and see that public transit gets talked by the Liberals for a while and then disappears.

Because they can do distant reading then close reading they can learning new things. They saw how parties flirted with commenting and then stopped. You can also learn when you let others try questions.

He talked about the WALK portal that will give Canadians access to their web archives.

I then talked about the Ethics of Scraping Twitter.

Federico Nanni talked about his work on the history of universities. He wanted to look at the recent history of the University of Bologna. One of the problems is that there is no Italian national archive. The University of Bologna was excluded from WayBackMachine. Has the UB sites of the past been completely lost? Federico then had to take other approaches. He conducted interviews with the old web masters, but that didn't help. Some of the other archives had UB sites because of links. People also suggested using newspaper archives because the newspapers often wrote about web sites.

The UB web site was excluded because of a letter from UB, but the WBM had actually crawled. It turns out there is a lot of information about courses and programmes. DART-Europe provides access to PhD theses, Dan Cohen released A Million Syllabi which has UB materials.

He now has lots of materials, but there is little metadata about discplines. He trained a classifier to detect disciplines for each page/document. They can recognize not only discplines, but also interdisciplinarity.

Now that we have new sources and new methods - can we now pose new questions? He proposed a new question: Is it true that the academy is experiencing a computational turn?

He ended by reflecting on the training of graduate students. He feels it is important to offer DH training early.

Short Papers: Text Mining 2

Stefano Perna and Alessandro Maisto presented in Italian on RAPSCAPE – un’esplorazione dell’universo linguistico del rap attraverso il text-mining e la data-visualization. They created a platform for studying rap music - an important phenomenon in Italian popular music. Why rap? Because it is one of the most influential musical genres internationally of our times. Rap has a high level of linguistic creativity. In Italy there haven't been many studies of rap. In their project they focus on the linguistic aspects of rap. Their project had 3 phases:

Corpus building
NLP - data mining
Visualization

The corpus creation was complex because the lyrics aren't typically published or gathered. The main resource for creating the corpora are the fan sites that gather lyrics like http://raptxt.it - the quality is mixed. They created a crawler for these and they scraped sites with metadata. They have about 2400 songs. The had to clean up the data, tag it, lemmatize, and then analyze it. They tried author similarity and other mining techniques. They will soon be putting online what looks like a nice visualization tool. They can look at word networks or collaboration networks. They hope to next explore relationship with music.

They talked about issues with slang and dialects. They also talked about swear words and copyright.

Johannes Hellrich presented on Measuring the Dynamics of Lexico-Semantic Change Since the German Romantic Period. Johannes is interested in language change. There are several ways of measuring language change. Word frequencies is one way. Cooccurance is another. word2vec is a tool that looks at embeddings over time. He showed a graph of how the collocates of "gay" change dramatically in the 1980s. He took this model and applied it to German. He used the Google ngram corpus to study words and their collocates. He showed a shift in the world "heart" to anatomical senses. Some problems include the sampling the do. He proposed a future "dragnet" historical linguistics.

Stefan Pernes presented on Metaphor Mining in Historical German Novels: Using Unsupervised Learning to Uncover Conceptual Systems in Literature. Stefan started by talking about metaphor as a mapping. His approach stays within the sentence for the metaphor - he looks at word pairs. He is clustering and then trying to graphing the results.

Martijn Naaijer talked on the subject of Linguistic Variation In The Hebrew Bible: Digging Deeper Than The Word Level. He started talking about the history of biblical Hebrew. We don't know a lot about the history of the Herbew language. It is difficult to separate composition and transmission.

Janet Delve and Sven Schlarb discussed Using Big Data Techniques For Searching Digital Archives: use cases in Digital Humanities. They began by describing the problem of all these archives that suddenly have to archive digital data (as opposed to print.) All sorts of organizations got together to address this challenge of e-government. This is important as we all need to have good solutions to government archiving. Their goal is to produce standards and tools. Then they gave an overview of the architecture, E-ARK. They do "package" transformations and can do this in parallel. The architecture is python, django with Celery. There is Solr for faceted search. Sven showed a remarkable staircase of steps that a package goes through. Then he talked about some use cases. One used Stanford NER to find locations and then visualization using Peripleo. They seem to be able to do very sophisticated time and place NLP and visualization. They do text classification too. Prototype of E-ARK|Prototype available online.

Helen Agüera: Early Funding of Humanities Computing: A Personal History

Helen Agüera was the Busa award winner this year. She worked on the TLG, promoted the TEI, fostered all sort of projects, and supported DH in the NEH.

Agüera started by talking how she was astounded when informed that she had been awarded as she was a grant administrator. She also feels this recognizes the NEH's support of DH. Her talk, however, is her personal history from 1979. She commented how they didn't have at the NEH the relevant records about earlier awards. It is only recently that they digitized their records. Early annual reports and magazines describe the expectations of the NEH of the field of DH. Use of computer technology was important to the NEH from the early years. The NEH was established in 1966 after a report that called for the use of computing in research. The first annual reports discuss teh funding of experimental work. In 1967 there were several grants, including grants to explore the use of computers in teaching.

She gave some examples. She talked about an early and successful institute in Kansas that I think was run by Sally Sedelow. She described a Stanford bibliographic project that was influential. There were two grants in 1975-7 to the MLA to develop a new system for the MLA bibliography. This should be researched more. In 1974 Andy van Damm at Brown got a grant for a hypertext system. It was used in an English class. There is a recent documentary on this project.

Then she focused on projects after 1978 when she joined the agency. She was in the scholarly editions unit that supported projects that needed long term funding. A few projects that used computers were funded including one by Shillingsburg (CASE). Chesnutt had a project too. The development of many research tools like dictionaries were facilitated by computers, even if they were published in print. There were a few projects that published electronically. She had to negotiate with the NEH library about to do with the digital outputs. Agüera talked about how dictionary projects had to migrate their data to computers as the amount of data overwhelmed filing cabinets.

In the 1980s and 1990s editors began to use PCs for projects like encyclopedia projects and more and more published online. Projects that had hundreds of scholars worldwide needed computing to manage. She mentioned an encylopedia of the Iranian world.

One of the most notable projects they funded was the TLG. From 1974 NEH grants supported the digitization of texts (offshore). She talked about how the TLG went from being available from a mainframe to being available on a Ibycus workstation to then becoming a web service. Projects like this showed the need for standards. Agüera encouraged Nancy Ide and others to submit a proposal even though they had never funded something like this. She and two other NEH staff attended the Poughkeepsie meeting. The TEI was funded repeatedly to develop guidelines. The NEH has also funded activities that promote the guidelines. The TEI is now widely used and has influenced other markup schemes. They also funded the development of historical minority language scripts and fonts for unicode.

For over 15 years the NEH has participated in international initiatves like the Digital Libraries Phase 2 that led to things like Fedora. They have supported the National Digital Newspaper programme with the Library of Congress. They have supported endangered languages projects. Recently the NEH has announced a challenge to people to try using the data and APIs of the Chronicle of America project.

The NEH itself has been significantly changed by the use of digital tool. Agüera lived through this changed. Digital technology made a number of their processes easier.

Now the NEH has become a leader in the Digging into Data international platform.

When asked for advice for applicants, she advised that we send questions and draft ideas to many programmes in the NEH to find the right fit.

Friday, July 15th

Saul Martinez Bermejo: "El Atambor de Plata Suena como Cascaveles de Turquesa". Reconstrucción de la Experiencia Sonora de la Colonización Europea (c. 1480-1650) a Través de un Glosario y un Tesauro Digital.

Bermejo is trying to reconstruct historic sounds of the European colonization. There are a lot of problems with such reconstructions. There is no thesaurus to use in cataloguing. How to establish a syntax to make sense of the evidence accumulated (Smith 21002). We have to deal with modern ideas of what a sound is and what is important.

We have guides to sound effects, but these are not connected othe other ideas. We can use dictionaries, encyclopedias, images - and then try to define equivalents. There is a SKOS primer W3C that provides guidelines for RDF of informal knowledge.

His limit case is "silver drum jingles like turquoise jingle bells" - a text from a grammar that deals with Nahuatl translating it into Spanish (and now English.) How does one recover what was meant in the original poetry (which has its own music.)

The aim of the project is to increase awareness of the importance of sound. He wants to create tools for describing and analyzing sound both aural and non-aural evidence. Then he wants to link to other glossaries to build a web of language.

Christian Wolff: Tool-based Identification of Melodic Patterns in MusicXML Documents

Wolff presented a tool that is being developed to recognize melodies. At his library they have 140000 sheets of handwritten folksongs collected from 1914-1943. The collection is being digitized. It has melody and typed lyrics. They have various projects to recognize different parts of the sheets. This project is trying to recognize the musical melody, metadata, and lyrics. Music recognition isn't working that well so they are looking at a project where someone plays the melody and that is recognized.

If they had the data they could ask about characteristic patterns, regional patterns, or changes in melodies over time. Music information retrieval isn't new and there are many approaches. Different approaches suit different ways of representing music.

Their tool uses Music XML (an emerging standard) which then can analyze it. They store stuff in a database and they can render scores from the data. They have Melodic search which is now "exact match" and they hope to allow more fuzziness. They want to use the Parsons code to represent melody and analyze it.

One can see it at Music XML Analyzer.

Stephen Ramsay and Brian Pytlik-Zillig: Picture to Score: Driving Vector Animations with Music in the XML Ecosystem

This short paper was about making music not analyzing it. The title reverses the usual way music is made to fit a picture. Here they are going from picture to score. They built a tool where one starts with sound and drive the visual. One can do this with Max MSP, but it is expensive. The authors wanted to escape reliance on expensive tools. Brian uses SVG and Indigo to generate the art visualization. For this project they wanted a cheaper route and settled on Music XML which is a standard for interchange.

He played a peice called DH 2016. He composed it on a score. There is a lot going on from pitch, dynamics, tempo and various changes. Music XML can capture most of it. Once you have the Music XML one can use XSLT to transform the XML into SVG animations. Steve played the animation which connected the musical events to visual events. It showed instruments as gears.

He concluded by pointing out how this environment is free as it uses all XML technologies.

You can see some of Steve's videos at https://vimeo.com/user1776782

Magdalena Turska: A lesson in applied minimalism: adopting the TEI processing model

Turska presented a project that she is working on with James Cummings. This project aims to minimize the effort when publishing TEI. Most approaches take lots of scripts to process TEI XML. Their answer is the TEI Processing Model. This is an abstract model for describing processing expressing in ODD. You write your ODD (which itself is TEI?) and that guides the transformations. There are two implementations - she recommends the XQuery one.

She gave some examples starting with the Foriegn Relations of the United States journal. Then she showed SARIT that shows how they can handle non-European languages. They can also handle different types of output from PDF to html and custom outputs.

ODD saves time writing complex XSLT. The target is to give power to the editors. The TEI Processing Model Toolbox is available.

Tanya Clement: ARLO (Adaptive Recognition with Layered Optimization): a Prototype for High Performance Analysis of Sound Collections in the Humanities

Tanya Clement talked about the HiPSTAS project that is working with archives of spoken word (poetry readings, storytelling ...).

Clement talked about all the types of sound in the recordings they have. There are all sorts of features you can pull out. She talked about how hard it is to handle sound in research and teaching and one goal of the project is to make it easier for researchers and students to be able to access sound.

They have developed a tool ARLO (Adaptive Recognition with Layered Optimization) that works with spectograms of sound recordings. I am amazed how much information there can be in recordings. For example, the hum of background machines can tell you about time and place. With ARLO you can train a machine to recognize a pattern like applause. You can then classify snippets and ask for other similar collections. You can visualize applause happening in other recording.

Why would they look for applause in PennSound? How much of the research we do is based on the technologies at hand? How do you get people to think about what they don't know? Applause is often a delimiter between readings so it can be used to split longer recordings. It is also a signifier of the engagement between poets and their audience. She also talked about mistakes where, for example, a bag pipe was seen as applause.

She talked about how questions about men and women don't make for good cultural analysis. There are far more important variables. Reading series where poets and audiences form and maintain tastes. The venues make a difference. They seem to be more of a signifier of cultural changes. They found real difference in applause based on site.

She has a paper out in the new Journal of Cultural Analytics. See http://culturalanalytics.org/2016/05/measured-applause-toward-a-cultural-analysis-of-audio-collections/

Raffaele Viglianti: Music notation addressability

Viglianti talked about how text is a massively addressible data format. Dependent on sequentiality of text. We can address units like characters, but also more abstract units from words to speeches (especially when you have markup.)

Music is very different. It need lots of markup to be represented in a way that can addressed. To be able to addressed music we need to deconstruct the non-linear aspect of music into something linear that can be addressed the way text can. His project Enhancing Music Notation Addressability is thinking about this problem of addressing music notation useful. The result was a Music Addressibility API. He showed some examples of how one can address a particular part of a score with this. He discussed the International Image Interoperatbility Framework which is a similar project.

Why would this be useful? It can be used in citation. It can be used in a proces change in some larger analysis system. The idea is to allow us to do with music what we do with text.

To test this they are partnering with the Digital Du Chemin project which has incomplete scores. They were able to develop nonopublication assertions about the missing chunks through their addressing. They are also developing an implementation called Open MEI addressability service. He also talked about a project where scores deliberately quote each other. Using their addressing scheme they can accurately reference the quotations.

See http://bit.ly/EMAwhite for a white paper.

centerNet AGM

Katherine Walter chaired our Annual General Meeting.

Chelsea Miya presented about the New Scholars Symposium that took place before the conference (see top of this.)

We heard about DH centers in India and the challenges they face especially the challenges of Indian languages. We were asked if there would be support for a virtual center or distributed center.