philosophi.ca : Mind The Gap A Multidisciplinary Workshop Bridging The Gap Between High Performance Computing And The Humanities

These notes are for a workshop organized at the University of Alberta around humanities research and high-performance computing. They are being written on the fly so the notes are not complete and will be full of typos. The twitter tag is #mindgap . The web site for the workshop is at http://ra.tapor.ualberta.ca/mindthegap .

The University of Alberta has an Express News story up.

Day 1, Monday May 10th.

Introductions

Here is what I wrote to introduce the purpose of the workshop.

First of all, we should think of this as an unworkshop. This has not be organized to train you, but to share the expertise we have across various boundaries. I hope this will be a chance for folk in the HPC world to get a sense of what our research questions are and for us to get a sense of the opportunities of HPC. To that end we have lots of time to try alternative things. If it seems that there is something we want to discuss in the large or in groups, or if there is training we want, there is lots of time and we can organize it.
A second purpose is to bring us together with generous time to talk. For too much of the year we are too busy to have the sustained conversations. I hope you will use the breaks and the team times to develop research questions and ideas that can be articulated through innovative computing. My hope is that you will use this week to advance your projects and possibly to articulate new ones.
A third purpose is to have time to hack together, even for those who don't program. This is a time to prototype. To that end I hope the teams, whatever else you do, will think about how to prototype something that can be shown and discussed on Thursday afternoon.
A fourth purpose is to see if we can develop a shared agenda for research in the humanities and HPC. We have the attention and support of two of the mature HPC consortia in Canada. In my conversations it is clear that they genuinely want to support research in the humanities, but we have to learn about what HPC facilities can do and begin to talk about collaboration. I hope by the end of the week we can have a conversation about what opportunities there are and what is needed so that we can work toward larger initiatives. To that end we have the university CIO Jonathan Schaeffer on Friday to talk to us and Stan Ruecker will be facilitating. My goal is to be able to write an open white paper that can be shared later.

Masao Fujinaga: Introduction to HPC at Alberta

Fujinaga took us through connecting to the wireless and checkers (one of the WestGrid machines.) His slides are at http://hypatia.cs.ualberta.ca/~msmit/MtGIntro.pdf

Teams

Much of the time is being spent in research teams working on projects:

Criminal Intenet: Datamining the Old Bailey
Linking Censuses
CWRC and Orlando: Searching large tagged data
The Epidemiology of Ideas
Text Analysis in the Cloud

Day 2

What to do with too many facts?

The "shirt" cluster of words is used more often in 1890s than in the 1840s and more often in the 1840s than in the 1800s.

Patrick Juola and Steve Ramsay generated 87,000 new facts about Victorian literature using the WestGrid HPC Checkers cluster. The question now is what to do with them. What do the facts look like? Here are some about "shit" and "shirt":

38.479209370444146,1.0,shit,1850-1859,1820-1829,1.0 38.479209370444146,1.0,shit,1850-1859,1820-1829,-1.0 24.317031460379905,1.0,shit,1850-1859,1840-1849,1.0 125.76602495551391,1.0,shit,1840-1849,1810-1819,-1.0 147.90469814579023,1.0,shit,1860-1869,1830-1839,-1.0

67.49110217920969,1.0,shirt,1840-1849,1800-1809,-1.0 67.07130296769121,1.0,shirt,1890-1900,1840-1849,-1.0

What are these categories? Here are the thesaurus definitions for the two word categories:

shirt,O,basque,blouse,bodice,body shirt,body suit,coat shirt,corsage,dickey,doublet,dress shirt,evening shirt,gipon,habit shirt,hair shirt,halter,hickory shirt,jupe,linen,polo shirt,pullover,shift,tank top,top,waist

shit,Amytal,Amytal pill,BM,Demerol,Dolophine,H,Luminal,Luminal pill,M,Mickey Finn,Nembutal,Nembutal pill,SOB,Seconal,Seconal pill,Tuinal,Tuinal pill,a continental,a curse,a damn,a darn,a hoot,alcohol,amobarbital sodium,analgesic,anodyne,asshole,bagatelle,baloney,barb,barbiturate,barbiturate pill,bastard,bauble,bean,bibelot, bilge,bit,black stuff,blah,blah-blah,bloody flux,blue,blue angel,blue devil,blue heaven,blue velvet,bop,bosh,bowel movement,brass farthing,buffalo chips,bugger,bull,bullshit,bunk,bunkum,button, ca-ca,calmative,catharsis,cent,chloral hydrate,codeine,codeine cough syrup,coprolite,coprolith,cow chips,cow flops, cow pats,crap,creep,cur,curio,defecate,defecation,dejection,depressant,depressor,diarrhea,dingleberry,dog,dolly,downer, droppings,dung,dysentery,evacuate,evacuation,farce,fart,farthing,feather,feces,feculence,fig,flapdoodle,fleabite, flux,folderol,fribble,frippery,gas,gaud,gewgaw,gimcrack,goofball,guano,guff,gup,hair,halfpenny,hard stuff,heel,heroin,hill of beans,hogwash,hokum,hood,hooey,hooligan,hop,horse,hot air,hypnotic,jakes,jerk,jest, joke,junk,kickshaw,knickknack, knickknackery,knockout drops, laudanum,lientery,liquor,loose bowels,lotus,louse,malarkey,manure, meanie,meperidine,methadone,minikin,mockery,molehill,moonshine,morphia,morphine, mother,movement,narcotic,night soil,number two,opiate,opium,ordure,pacifier,pain killer,paregoric,pen yan,peppercorn,phenobarbital,phenobarbital sodium,picayune,piffle,pill,pin,pinch of snuff,pinprick,poppycock,prick,purgation,purge,purple heart,quietener,rainbow,rap,rat,red,red cent,rot,row of pins,rubbish,runs,rush,scag,scat,scum,secobarbital sodium,sedative,sewage,sewerage,shithead,shitheel, shits,skunk,sleep-inducer,sleeper,sleeping draught,sleeping pill,smack,snake,snap,sneeshing,sodium thiopental,somnifacient,soother,soothing syrup,soporific,sou,stinkard,stinker,stool,straw,take ashit,tar,toad,tommyrot,toy,tranquilizer,trifle,trinket,tripe,triviality,trots,tuppence,turd,turistas,turps,two cents,twopence,void,voidance,whim-wham,white stuff,wind,yellow,yellow jacket

Obviously the category for "shit" is a broad one that includes drugs. How then can we drill down on these "facts" to figure out what they mean and whether they are signficant? And ... what do we do with 87,000 such "facts"?

Robyn Taylor: Exploring Human computer Interaction through Performance Practice

Robyn Taylor of the Advanced Man Machine Lab in CS at the University of Alberta is interested in generated new design ideas for participatory performance for public spaces. She wants to generate ideas for shared visual spaces like store fronts. The logic of her research is:

How can we learn from performance
How can we apply performance ideas to participatory performance
How can we apply these ideas to participatory interfaces

All performance is to some extent interactive, but there is a real difference between an audience that observes, but can't make much of a difference, and participants who can change the performance.

Taylor then talked about flow as a goal. She is following Gaver in designing for ludic aspects of everyday life.

She then showed a series of experiments:

Dream Medusa was created for Nuit Blanche in Toronto. The metaphor was lucid dreaming. Up to four participants had wiimotes (in a silver tube) that affected the visualization (of jellyfish.) Participants weren't told what their tube would affect - they had to try things. In the video she showed you can see participants trying to figure out what they control.

The issue then was to figure out pragmatically how participants experienced the performances. They wanted to preserve the artistic performance while still evaluating engagement and experience. Taylor had a nice theoretical framework and then asked questions after authentic performances. They found that participants were quite willing to talk about their insecurities, performer's anxieties and comraderie. They were aware of being observed. They wanted to collaborate but didn't know how.

humanaquarium resulted from this first set of tests and starts moving towards a responsive window. Two performers sit in a large acquarium-like box. The touching then controls aspects of the performance. Participants could control a lot more in this experiment. They will show it at the Banff Centre in June.

They make it very obvious what touching affects and this seems to work. The box reduces stage fright as it presents itself as a busking experience. They used an iterative design process where they had a series of performances with time to change things between.

Patrick Juola: Computers, Conjectures, and Creativity

Patrick started with "it was a dark and stormy night" and the question of where this line originated. When did the line become a cliche? What types of authors use it?

How do we answer these questions? We read a lot and then write a paper. This obviously won't work with a million books. Computers, as many have commented, can do things with a million books. Search is one, but search doesn't help unless you know exactly what you want. You can't find a hypothesis you aren't looking for. This means that when we do analysis we are looking for what we want which means we probably won't find what we are not looking for.

There are examples of tools in the sciences that can use the computer for automatic hypothesis generation. Graffiti is one example that generates mathematical hypotheses that can then be tested.

Patrick's Conjecturator tool, like Graffiti, generates hypothesis for literary work. It takes a large corpus of texts and a thesaurus of clusters of words. It then tests different combinations (as in words about shirts in the 1840s vs 1850s.) Some examples:

Male Authors use Animal Terms more/less than Female Authors

Poems use "From" more than Prose

You can see conjectures being generated on Twitter - go to http://www.twitter.com/conjecturator

As mentioned above, Patrick adapted the Conjecturator to run on WestGrid overnight and it generated 87,000 "facts" about 19th century literature after testing many more hypotheses. The problem is now what to do with that many facts.

These facts are not necessarily interesting. What do you do to find the hermeneutically interesting facts in the result set?

From there Patrick talked about our project on the epidemiology of ideas. The conjecturator, if we had a large corpus of articles across humanities disciplines, might be able to generate interesting "facts" about the history of ideas.

Afterwards we got an email that the Conjecturator used 184 days of processing!

Day 3

Visit to AICT facilities

We visited the visualization labs, the machine room and the 3D printing area to give us ideas of how we can use WestGrid facilities beyond just HPC. We saw Checkers and other servers (like TAPoR) in the machine room.

See photos on Flickr http://www.flickr.com/photos/geoffreyrockwell/sets/72157623924208575/

Paul Lu on MapReduce and Hadoop

MapReduce is a specialized programming model/approach. Hadoop is an implementation of MapReduce. The key ideas are:

Data is partitioned across a cluster
First phase of computation is a map() function
The map is pipelined into a reduce() function

The idea is to take a very large amount of data and the system will figure out how to split it up (map phase) to process and then combine results for an output (reduce phase).

Paul Lu, Cloud Computing and HPC

Paul Lu gave an excellent overview of what cloud computing is.

Computing or data resources provided by a third-party and usually accessed over a network. Examples would be Gmail, Hotmail, Google Docs, Facebook, and Salesforce

A cloud service is also an "application service provider." In some ways it is computing over the web. (The web is becoming an operating system.)

There are three kinds of cloud computing:

Software-as-a-service (SaaS) Gmail and Google docs would be examples.
Platform-as-a-service (PaaS) The Google App Engine would be an example. (I wonder if we should be building text analysis tools on the Google Apps.)
Infrastructure-as-a-service (IaaS) Amazon Elastic Computing Cloud (EC2) or Amazon S3 (Simple Storage System)

We had an interesting conversation about whether researchers should be moving to getting funding for cloud computing rather than buying stuff.

Paul then asked the question, "Isn't this just client-server?" Yes, but there is more, including service-level agreements, web services, virtual machines, and virtual networks.

Another way of thinking of a cloud is that at the back end it has virtualization. Then there are web services built over that.

Paul then compared Traditional HPC and Cloud Computing:

Traditional HPC is batch scheduled. It is big computation, big memory, big storage and big I/O. Cloud computing is interactive and on demand. It has variable amounts of computation, memory and networking.

Traditional HPC is good for simulations, data streams and big question research. Cloud computing is suitable to humanities, portals and iterative questioning.

Paul then talked about Humanities and the Cloud. Here are some of the issues:

Skills: Software development is a bottleneck for humanists
Usage Modes: The humanities typically need more cloud-like services
Software Tools: Humanists often need different software components

Paul then talked about how to add Cloud to HPC? Some ideas:

HPC facilities can add a cloud-like service. They set aside some servers for this. Sharcnet did this but people worry about letting cycles go unused and that irritates traditional users.
Go with Amazon. This might have problems around sensitive data and, of course, funding issues.
Develop new infrastructure (VMs, metascheduler) - come up with better schedulers that can mix cloud and batch allocation.

Paul believes that ultimately academic consortia can no longer compete with Amazon EC2 on cost. There are pros and cons to the all-you-can-eat model vs the pay-as-you-go model.

Day 4

Stephen Ramsay, Knowing It When You See It: Humanistic Inquiry in an Age of Big Data

Steve started with some slides including one of a "Wunderkammer". He talked about how these Wonder Rooms as an image of the arrangement of knowledge with man at the centre. He then connected that to what we know and don't know (and don't know we don't know.) See Schwartz on what we don't know . For Schwartz education is about moving things from what we don't know we don't know to the category of what we don't know.

Steve then connected this to reflections on Matthew Arnold Culture and Anarchy and Arnold's idea that "we" are concerned with the "best which has been thought and said". Imagine if the wunderkammer had everything (not just the representative or the best) - that is the problem of big data. What do with all of all of human thought and expression. HPC in the humanities is about learning about what we don't know we don't know from all the data.

Steve proposed three things that need to

Playful Systems. When we have over 10 million books we need to embrace the playful. We need systems that help us trip.
Asking river to stop so we can step in it twice. The anxieties of the humanities about too much data are nothing compared to anxieties of librarians about the stability of data. We need to develop systems
Backward Systems. We don't know what we are looking for. We are supposed to develop a hypothesis and then test it. We are not supposed to invent instruments and then ask what they can tell us. Any problem you can state is outside the realm of what you don't know you don't know. Which is why we need to be open to new methods and instruments that we play with and see something new.

The gap that should concern us is that between the rhetoric of solving problems and what is happening with HPC for both humanists and others. It not just humanists that are struggling with the utilitarian rhetoric of big problems. In many ways we need to acknowledge the importance of playing with supercomputing.

The goal of the humanities is to keep the conversation going, not to get results. You can see how the humanities will never satisfy those who ask what use our learning is. Our learning just leads to awareness of more ignorance. It is worse than not advancing knowledge - it critiques what we thought we knew to the point there is nothing left. Why fund that?

Megan Meredith-Lobay, GIS

A geographic information system captures/stores/analyzes data linked to location. It lets you study patterns in the form of maps. It shows you where things are taking place. With GIS you can have both raster and vector information.

The key to GIS and the humanities is "contextualization". GIS lets us see context for data.

We discussed when to use GIS and when to use mashups to things like Google Maps.

There seems to be a difference between the GIS community and the literary computing community. The GIS folk are using commercial software on PCs. The lit community has moved to open source on the web.

Presentations

Linking Censuses

The LC project (Andrew Ross) is trying to link people from one census and another even though names change, people move and so on. They have many censuses from Canada and the US. They want to link across them so we can see people immigrating from Canada to the US, for example. Automatic linkage happens between two Census datasets and it goes through all names trying to classify if there is a match or not. This is very computationally expensive. They have 8-22% linkage depending on things. They are getting too many false positives. One question is whether they can map the results with GIS to see if there are a lot of holes.

Lessons Learned

Dave McCaughan from Sharcnet shared some thoughts about lessons learned:

Humanists have a compelling argument for speed based on interpretative time.
Time and Space. When is there a need for HPC? Is it parallelism or just speed that is needed. The demands for space can be far more complex as humanists have lots of data.
Humanities is process focused rather than results focused
Flexibility is key - HPC folk need to listen to humanists
Subtle issue about the tools that support human interaction - respect the tools of the humanists

Dave also showed a tool that lets them check questionable links and approve them.

Text Analysis in the Cloud

Mike Smit presented on a project to make analytics more efficient on the web. They experimented with indexing and using cloud computing. They found that after 2 queries things are faster on the indexed cloud than running each tool again. They are now working on very large text datasets (gigabyte sized.)

One of the things they are doing is building simple operations that can then be composed in different ways. He then showed a recommendation engine demo that is one way of composing operations. Extremely neat!

Old Bailey

Jamie McLaughlin was the first to present on the Criminal Intent project that is datamining the Old Bailey corpus. He talked about the dataset and what we are doing with it and Voyeur and Zotero. He talked about the place of an API in this project so that Voyeur and Zotero can be passed data.

St�fan Sinclair then talked about doing Normalized Compression Distance (NCD) between documents. NCD is one measure of similarity between documents. He showed what was done in Voyeur and showed a visualization of a small subset of the documents. Old Bailey has 200,000 trial records which means 40 million NCD tuples which would take a long, long time. He is now implementing parallelized code to run on 100 cores.

Sajib Barua has been doing a number of experiments of ways to datamine the data. One approach he prototypes was to use a data warehouse paradigm that uses the tag information to build comparative graphs. He showed examples using the Old Bailey data in Tableau. A second approach he tried is Frequent Itemsets (FI). Again he adapted the data and ran it to generate some example FIs like "silver peices money called king". The last approach is to try clustering trials using the FIs.

Epidemiology of Ideas

Patrick Juola talked about the application of the Conjecturator to various datasets. The Conjecturator is easy to parallelize. He used about a year of compute time on Orlando, Victorian Novels, and on the Old Bailey trial records. The big problem is the large number of assertions that the Conjecturator generates. What do we do with the pile of facts? He showed a visualization that we can use to look at the assertions.

Orlando

Susan Brown talked about experiments to improve XML (actually SGML) searching of a 8 million word corpus. They are trying to provide users with a much smarter interface to Orlando. One thing they are doing is trying clustering tools to see if they can help people navigate results. They got a nice clustering of novelists which was visualized in the Mandala browser.

Day 5

Jonathan Schaeffer on HPC and Humanities

Jonathan talked about where we are today and what is right and wrong about Compute Canada. Compute Canada Calcul is a national HPC platform.

1997: C3.ca created; MACI (Alberta)
1999: First CFI awards
2002: WestGrid 1 (Alberta and BC)
2005: C3.ca Long Range Plan for HPC
2005: WestGrid 2 (Western Canada)
2006: NPF-1 application successful; Compute Canada formed

Compute Canada is 61 institutions across Canada. Over 1,100 PIs across Canada with associated research assistants. Usage is dominated by sciences: for example Physics 24%, Biochemistry 19%, Chemistry 18%, and Engineering 16%.

Compute Canada got about 150 million which seems like a lot, but there hasn't been the refresh funding. CC is operating poor. Good projects should have 50% to 100% more funding again for operating, CC doesn't have that - they are underfunded at the moment. CC has all this terrific equipment but not enough operating funds to really take advantage of it.

CFI projects are very hard to run with so many players as it bottlenecks around the slowest.

There are many successes:

Creation of CC is a milestone
Unprecedented HPC funding (for Canada)
National and international collabortion
Great research being done
A sense of community

The call for NPF-2 is coming. It will probably take until 2014 before anything can be bought. Planning is underway.

CFI has been blunt about things that need to be done better:

CC is not as "national" as it could be.
Need to engage "non-traditional" disciplines from medicine to the humanities.
Diluting money over multiple sites. Do we really need a site at every institution?
Too much money spent on renovations. Do we really need geographical proximity to equipment resources?

NPF-2 will not be limited to HPC. CC will be in competition with many other national projects. Digital humanists should articulate our needs so that CC can work with us.

Computing needs to be talked about from source to destination. From acquisition of data to dissemination of research. We need to invest in people, in cloud computing, and in visualization.

Final Discussion

We had a final discussion on the needs of computing humanists and ways of engaging the CC community. The final report with recommendations will be put up once written.

Possible recommendations

Here are some possible recommendations coming out of the workshop.

Many digital humanities projects are programmed in Java which is considered inefficient for HPC systems. Is it worth our while to develop a library of basic operations in C for HPC applications?
Many projects are interested in moving to Hadoop. How can a HPC consortium provide a Hadoop cluster for DH projects? Sharcnet is offering a cluster for humanists (and others) to use. The question is whether there is enough of a real need for similar functionality in WestGrid.
We should be developing training with AICT that bridges language.
We should be looking at Eucalyptus - an open alternative middleware that offers the same API as Amazon so one can start on a server here and then move to Amazon.
Every interface is an argument as Laura Mandell has argued. Interfaces to HPC results are thus important which is probably why visualization is often connected with HPC.
We want to bring back the term "supercomputer". It is more accurate and playful than HPC.