philosophi.ca : Pathways To SEASR

In January of 2009 I attended the Tools for Data-Driven Scholarship meeting funded by Mellon and hosted by the NCSA.

Note: These notes were written during the conference. They are biased and only what I had the time to type. As a result, when things got interesting I stopped taking notes.

Day 1

Thursday, January 15, 2009.

Introductions

The first session included three introductions. Chris Mackie gave us a picture of what he hoped for SEASR and how it is different from other tool projects. For Chris the difference is sustainability and the sustainability of SEASR will come from its adoption by others. Hence the reason for Pathways is to get others (us) to learn about SEASR, try our ideas out, and then be willing to install SEASR locally and extend it. Mackie encouraged us to think of ourselves as co-owners, co-navigators, and co-developers. I suspect that eventually we will see a consortial model, possibly as part of Bamboo, but for the moment it seems less formal.

I have to say that sustainability is not yet the difference with other tool projects. SEASR has not yet proven itself sustainable, though they seem to have a good pathway mapped out. Others have talked about sustainability, but that doesn't make it so. Chris is right that this an important issue and SEASR has many of the structural features needed, but it is now up to the community to dig in, use it, invest in it, and make it work. I should add that I am also not fond of the sure of the story that sees a landscape of moribund projects that were not sustainable (when they should have been.) That is the concern of funders, but not true to the scholarship in humanities computing. I can trace a history of OCP to TAPoR where ideas and technologies evolved, often in response to new expectations of users. Others could probably do the same. Perhaps we have oversold tool projects as "the last tool project because my one will make solve the tool problem for all time." Increasingly I am thinking of tools projects as a form of interpretation. That said, SEASR has the welcome features that it can, like Jason's boat, be replaced bit by bit and still be a boat. It can reduce the reinvention for interpretation.

Michael Welge gave an overview of SEASR and the workshop. He expanded on the sustainability model talking about the three social elements in SEASR.

Meandre - the visual development environment in which people can build flows (tools?) out of components. Flows can themselves become components in other flows so there is a,
Community Hub or the online place where people can share components and flows to build their own, and a
ZigZag scripting language for developing flows.

Michael also described SEASR as a data-driven environment, by which they mean that the arrival of data triggers things (as in a data-flow environment.) I think there is a weak association to the hermeneutical principle that the text should drive the interpretation.

Zotero

Xavier Llora gave an interesting demo of a proof-of-concept using Zotero to connect to SEASR. I love the idea of plug-ins to . The idea is that you can create an application in SEASR that can be a plug-in to Zotero that you can then run on any text (or collection) saved. Zotero can thus become an interface to other tools and one could build a set of favorite tools from various places.

Xavier also talked about a connection with Fedora which I believe allows one to save a collection from Zotero to a repository that could then be processed by SEASR.

UIMA

UIMA stands for Unstructured Information Management Applications and is an IBM analytical engine for unstructured data from phone conversations to e-mail. They gave an exmaple of a flow of UIMA modules (chains) for part of speech tagging. They describe the flow (chain?) in XML. It seems to me similar to SEASR, but without the visual flow programming environment. I'm assuming that UIMA has a whole mess of modules that are useful to SEASR like "sentiment analysis". The presenter gave an example of using Mark Twain. He joked that it was, "How to cheat English literature with computer science."

Next we saw an example of sentiment analysis using SYNnet, a tool that uses synonym connections using a thesaurus. I like the idea of using a thesaurus to map words so you can find a sentiment like "joy" by finding the synonyms.

The point was to show how one can integrate other tools into SEASR just as the Zotero demo showed how SEASR can be integrated into other things.

NESTER

Steve Downie demoed NESTER (Networked Environment Sonic-Toolkits for Exploratory Research) and NEMA (Networked Environment for Music Analysis.) Before that he gave some background on the MIREX evaluation exchange and the model of having virtual labs that can analyze proprietary datasets (ie. lots of copyrighted music) without access. The idea is really smart - that people can submit algorithms that can be plugged into a SEASR framework that are then run behind the copyright firewall here and the results sent to the researchers. I think Steve has demonstrated the value of M2K (built on D2K the predecessor to SEASR). I have seen Steve's demos, for example at the SHARCNET workshop on humanities and HPC, and the link between the society (IMIRSEL), the exchanges, and the tools is compelling. Here is showed slides from a bird song project where the system can be trained to recognize songs. SEASR gives them the ability to put together web services in the visual programming environment.

Meandre

In the category of a great name for software is "Meandre", which doesn't stand for anything, but does sound like what you can do with it - meander through ideas of text. It is first of all a data-flow visual programming environment for SEASR (and other) components. It implements the idea John Bradley and I had in Eye-ConTACT. Important is the standardization of how we define a component and how we define a flow. In principle Meandre could disappear over time if the standardization is done right. RDF is used to describe flows and share them. Then one can do reasoning on top of this.

Meandre's metadata is an important development that builds on Dublin Core (for texts) adding component flow descriptions. St�fan was working on TAML and TARL to do things like this.

You can experiment with Meandre at http://seasr.org/download/ . This worked quite smoothly on my Mac, I was impressed. I was up and generating word clouds before the demo was over.

ZigZag is a scripting language based on Python that lets one write flows and then run them elsewhere. ZigZag has automatic paralellization if you have access to a cluster.

Both Meandre and ZigZag output the RDF descriptors of a flow. One way to think of these is that one is a visual programming environment and one is a scripting programming environment. The RDF descriptor is presumably then run by some engine. They have a MAU file (Meandre Archive Unit) that bundles the flow and components together into executables.

Community Hub

Loretta Auvil talked about the community hub where components and flows can be discovered, shared and executed. The community hub has some ManyEyes features, but isn't quite working yet.

Then things got really technical as we were walked through running Meandre.

Eclipse Plug-In

Amit Kumar introduced an Eclipse Plug-In for managing components on the server and creating new ones.

Adoption

John Unsworth talked about adoption. The Pathways project is to help us adopt. John went on to talk about the Hathi trust. They hope to have at UIUC a repository of texts from the Hathi trust, Google books and so on (which would be millions of books). This captive collection may be only usable at UIUC or through some trusted mechanism. The idea is that this captive collection will need various tools to be accessible and SEASR could be the way cool tools are developed.

Day 2

MONK

St�fan Sinclair demonstrated MONK (Metadata Offers New Knowledge). MONK offers a more accessible interface to SEASR for academics. They have also been adapting SEASR so that it can be combined with other projects into applications.

Tools get gathered into Toolsets that can be gathered into an Application. The applications look like what you would share with colleagues, but even the Workbench is usable.

Part of MONK is a preprocessing system that gets the XML texts into a common format and "adorns" them with part-of-speech tags. The philosophy of MONK is to pay attention to metadata and encoding of texts in order to get better knowledge from analysis.

MONK demonstrates the sustainability model of SEASR which is to define the articulation of components and applications. SEASR proposes a model for how research tools for rich media should be thought through. Any one element can be replaced over time. SEASR could even disappear over time, but the model and the interfaces remain. Over the two days I've seen different projects use different parts of SEASR. SEASR seeds the model with components, server engine and different application building environments. Any of these could probably be replaced overtime and by particular projects. The question is whether the articulation model is right? SEASR will have got it right if tool developers can work with it, reuse components they need easily (rather than writing them) and be able to quickly build new components that work for others. That's what we are here for, to use SEASR and see how easy it is to use and to then help them improve it. I think we are in the formative phase of helping SEASR

FutureLens

FutureLens is based on FeatureLens, a visualization tool from Maryland. One neat feature is the ability to combine terms. If you see two patterns that are the same phenomenon you can combine them and then see the distribution of the combination. Thus it lets you create clusters of patterns into themes.

VUE

Anoop Kumar of the Visual Understanding Environment team showed their project which allows one to create visual concept maps that can then be used for presentations. His presentation was created with VUE and he basically ran an animated step through his map. Very neat.