philosophi.ca |
Main »
Mind The Gap A Multidisciplinary Workshop Bridging The Gap Between High Performance Computing And The HumanitiesThese notes are for a workshop organized at the University of Alberta around humanities research and high-performance computing. They are being written on the fly so the notes are not complete and will be full of typos. The twitter tag is #mindgap . The web site for the workshop is at http://ra.tapor.ualberta.ca/mindthegap . The University of Alberta has an Express News story up. Day 1, Monday May 10th.IntroductionsHere is what I wrote to introduce the purpose of the workshop.
Masao Fujinaga: Introduction to HPC at AlbertaFujinaga took us through connecting to the wireless and checkers (one of the WestGrid machines.) His slides are at http://hypatia.cs.ualberta.ca/~msmit/MtGIntro.pdf TeamsMuch of the time is being spent in research teams working on projects:
Day 2What to do with too many facts?The "shirt" cluster of words is used more often in 1890s than in the 1840s and more often in the 1840s than in the 1800s.
Patrick Juola and Steve Ramsay generated 87,000 new facts about Victorian literature using the WestGrid HPC Checkers cluster. The question now is what to do with them. What do the facts look like? Here are some about "shit" and "shirt": 38.479209370444146,1.0,shit,1850-1859,1820-1829,1.0 38.479209370444146,1.0,shit,1850-1859,1820-1829,-1.0 24.317031460379905,1.0,shit,1850-1859,1840-1849,1.0 125.76602495551391,1.0,shit,1840-1849,1810-1819,-1.0 147.90469814579023,1.0,shit,1860-1869,1830-1839,-1.0 67.49110217920969,1.0,shirt,1840-1849,1800-1809,-1.0 67.07130296769121,1.0,shirt,1890-1900,1840-1849,-1.0 What are these categories? Here are the thesaurus definitions for the two word categories: shirt,O,basque,blouse,bodice,body shirt,body suit,coat shirt,corsage,dickey,doublet,dress shirt,evening shirt,gipon,habit shirt,hair shirt,halter,hickory shirt,jupe,linen,polo shirt,pullover,shift,tank top,top,waist
shit,Amytal,Amytal pill,BM,Demerol,Dolophine,H,Luminal,Luminal pill,M,Mickey Finn,Nembutal,Nembutal pill,SOB,Seconal,Seconal pill,Tuinal,Tuinal pill,a continental,a curse,a damn,a darn,a hoot,alcohol,amobarbital sodium,analgesic,anodyne,asshole,bagatelle,baloney,barb,barbiturate,barbiturate pill,bastard,bauble,bean,bibelot, bilge,bit,black stuff,blah,blah-blah,bloody flux,blue,blue angel,blue devil,blue heaven,blue velvet,bop,bosh,bowel movement,brass farthing,buffalo chips,bugger,bull,bullshit,bunk,bunkum,button, ca-ca,calmative,catharsis,cent,chloral hydrate,codeine,codeine cough syrup,coprolite,coprolith,cow chips,cow flops, cow pats,crap,creep,cur,curio,defecate,defecation,dejection,depressant,depressor,diarrhea,dingleberry,dog,dolly,downer, droppings,dung,dysentery,evacuate,evacuation,farce,fart,farthing,feather,feces,feculence,fig,flapdoodle,fleabite, flux,folderol,fribble,frippery,gas,gaud,gewgaw,gimcrack,goofball,guano,guff,gup,hair,halfpenny,hard stuff,heel,heroin,hill of beans,hogwash,hokum,hood,hooey,hooligan,hop,horse,hot air,hypnotic,jakes,jerk,jest, joke,junk,kickshaw,knickknack, knickknackery,knockout drops, laudanum,lientery,liquor,loose bowels,lotus,louse,malarkey,manure, meanie,meperidine,methadone,minikin,mockery,molehill,moonshine,morphia,morphine, mother,movement,narcotic,night soil,number two,opiate,opium,ordure,pacifier,pain killer,paregoric,pen yan,peppercorn,phenobarbital,phenobarbital sodium,picayune,piffle,pill,pin,pinch of snuff,pinprick,poppycock,prick,purgation,purge,purple heart,quietener,rainbow,rap,rat,red,red cent,rot,row of pins,rubbish,runs,rush,scag,scat,scum,secobarbital sodium,sedative,sewage,sewerage,shithead,shitheel, shits,skunk,sleep-inducer,sleeper,sleeping draught,sleeping pill,smack,snake,snap,sneeshing,sodium thiopental,somnifacient,soother,soothing syrup,soporific,sou,stinkard,stinker,stool,straw,take ashit,tar,toad,tommyrot,toy,tranquilizer,trifle,trinket,tripe,triviality,trots,tuppence,turd,turistas,turps,two cents,twopence,void,voidance,whim-wham,white stuff,wind,yellow,yellow jacket
Obviously the category for "shit" is a broad one that includes drugs. How then can we drill down on these "facts" to figure out what they mean and whether they are signficant? And ... what do we do with 87,000 such "facts"? Robyn Taylor: Exploring Human computer Interaction through Performance PracticeRobyn Taylor of the Advanced Man Machine Lab in CS at the University of Alberta is interested in generated new design ideas for participatory performance for public spaces. She wants to generate ideas for shared visual spaces like store fronts. The logic of her research is:
All performance is to some extent interactive, but there is a real difference between an audience that observes, but can't make much of a difference, and participants who can change the performance. Taylor then talked about flow as a goal. She is following Gaver in designing for ludic aspects of everyday life. She then showed a series of experiments:
The issue then was to figure out pragmatically how participants experienced the performances. They wanted to preserve the artistic performance while still evaluating engagement and experience. Taylor had a nice theoretical framework and then asked questions after authentic performances. They found that participants were quite willing to talk about their insecurities, performer's anxieties and comraderie. They were aware of being observed. They wanted to collaborate but didn't know how.
They make it very obvious what touching affects and this seems to work. The box reduces stage fright as it presents itself as a busking experience. They used an iterative design process where they had a series of performances with time to change things between. Patrick Juola: Computers, Conjectures, and CreativityPatrick started with "it was a dark and stormy night" and the question of where this line originated. When did the line become a cliche? What types of authors use it? How do we answer these questions? We read a lot and then write a paper. This obviously won't work with a million books. Computers, as many have commented, can do things with a million books. Search is one, but search doesn't help unless you know exactly what you want. You can't find a hypothesis you aren't looking for. This means that when we do analysis we are looking for what we want which means we probably won't find what we are not looking for. There are examples of tools in the sciences that can use the computer for automatic hypothesis generation. Graffiti is one example that generates mathematical hypotheses that can then be tested. Patrick's Conjecturator tool, like Graffiti, generates hypothesis for literary work. It takes a large corpus of texts and a thesaurus of clusters of words. It then tests different combinations (as in words about shirts in the 1840s vs 1850s.) Some examples: Male Authors use Animal Terms more/less than Female Authors
Poems use "From" more than Prose
You can see conjectures being generated on Twitter - go to http://www.twitter.com/conjecturator As mentioned above, Patrick adapted the Conjecturator to run on WestGrid overnight and it generated 87,000 "facts" about 19th century literature after testing many more hypotheses. The problem is now what to do with that many facts. These facts are not necessarily interesting. What do you do to find the hermeneutically interesting facts in the result set? From there Patrick talked about our project on the epidemiology of ideas. The conjecturator, if we had a large corpus of articles across humanities disciplines, might be able to generate interesting "facts" about the history of ideas. Afterwards we got an email that the Conjecturator used 184 days of processing! Day 3Visit to AICT facilitiesWe visited the visualization labs, the machine room and the 3D printing area to give us ideas of how we can use WestGrid facilities beyond just HPC. We saw Checkers and other servers (like TAPoR) in the machine room. See photos on Flickr http://www.flickr.com/photos/geoffreyrockwell/sets/72157623924208575/ Paul Lu on MapReduce and HadoopMapReduce is a specialized programming model/approach. Hadoop is an implementation of MapReduce. The key ideas are:
The idea is to take a very large amount of data and the system will figure out how to split it up (map phase) to process and then combine results for an output (reduce phase). Paul Lu, Cloud Computing and HPCPaul Lu gave an excellent overview of what cloud computing is. Computing or data resources provided by a third-party and usually accessed over a network. Examples would be Gmail, Hotmail, Google Docs, Facebook, and Salesforce
A cloud service is also an "application service provider." In some ways it is computing over the web. (The web is becoming an operating system.) There are three kinds of cloud computing:
We had an interesting conversation about whether researchers should be moving to getting funding for cloud computing rather than buying stuff. Paul then asked the question, "Isn't this just client-server?" Yes, but there is more, including service-level agreements, web services, virtual machines, and virtual networks. Another way of thinking of a cloud is that at the back end it has virtualization. Then there are web services built over that. Paul then compared Traditional HPC and Cloud Computing: Traditional HPC is batch scheduled. It is big computation, big memory, big storage and big I/O. Cloud computing is interactive and on demand. It has variable amounts of computation, memory and networking. Traditional HPC is good for simulations, data streams and big question research. Cloud computing is suitable to humanities, portals and iterative questioning. Paul then talked about Humanities and the Cloud. Here are some of the issues:
Paul then talked about how to add Cloud to HPC? Some ideas:
Paul believes that ultimately academic consortia can no longer compete with Amazon EC2 on cost. There are pros and cons to the all-you-can-eat model vs the pay-as-you-go model. Day 4Stephen Ramsay, Knowing It When You See It: Humanistic Inquiry in an Age of Big DataSteve started with some slides including one of a "Wunderkammer". He talked about how these Wonder Rooms as an image of the arrangement of knowledge with man at the centre. He then connected that to what we know and don't know (and don't know we don't know.) See Schwartz on what we don't know . For Schwartz education is about moving things from what we don't know we don't know to the category of what we don't know. Steve then connected this to reflections on Matthew Arnold Culture and Anarchy and Arnold's idea that "we" are concerned with the "best which has been thought and said". Imagine if the wunderkammer had everything (not just the representative or the best) - that is the problem of big data. What do with all of all of human thought and expression. HPC in the humanities is about learning about what we don't know we don't know from all the data. Steve proposed three things that need to
The gap that should concern us is that between the rhetoric of solving problems and what is happening with HPC for both humanists and others. It not just humanists that are struggling with the utilitarian rhetoric of big problems. In many ways we need to acknowledge the importance of playing with supercomputing. The goal of the humanities is to keep the conversation going, not to get results. You can see how the humanities will never satisfy those who ask what use our learning is. Our learning just leads to awareness of more ignorance. It is worse than not advancing knowledge - it critiques what we thought we knew to the point there is nothing left. Why fund that? Megan Meredith-Lobay, GISA geographic information system captures/stores/analyzes data linked to location. It lets you study patterns in the form of maps. It shows you where things are taking place. With GIS you can have both raster and vector information. The key to GIS and the humanities is "contextualization". GIS lets us see context for data. We discussed when to use GIS and when to use mashups to things like Google Maps. There seems to be a difference between the GIS community and the literary computing community. The GIS folk are using commercial software on PCs. The lit community has moved to open source on the web. PresentationsLinking CensusesThe LC project (Andrew Ross) is trying to link people from one census and another even though names change, people move and so on. They have many censuses from Canada and the US. They want to link across them so we can see people immigrating from Canada to the US, for example. Automatic linkage happens between two Census datasets and it goes through all names trying to classify if there is a match or not. This is very computationally expensive. They have 8-22% linkage depending on things. They are getting too many false positives. One question is whether they can map the results with GIS to see if there are a lot of holes. Lessons LearnedDave McCaughan from Sharcnet shared some thoughts about lessons learned:
Dave also showed a tool that lets them check questionable links and approve them. Text Analysis in the CloudMike Smit presented on a project to make analytics more efficient on the web. They experimented with indexing and using cloud computing. They found that after 2 queries things are faster on the indexed cloud than running each tool again. They are now working on very large text datasets (gigabyte sized.) One of the things they are doing is building simple operations that can then be composed in different ways. He then showed a recommendation engine demo that is one way of composing operations. Extremely neat! Old BaileyJamie McLaughlin was the first to present on the Criminal Intent project that is datamining the Old Bailey corpus. He talked about the dataset and what we are doing with it and Voyeur and Zotero. He talked about the place of an API in this project so that Voyeur and Zotero can be passed data. Stéfan Sinclair then talked about doing Normalized Compression Distance (NCD) between documents. NCD is one measure of similarity between documents. He showed what was done in Voyeur and showed a visualization of a small subset of the documents. Old Bailey has 200,000 trial records which means 40 million NCD tuples which would take a long, long time. He is now implementing parallelized code to run on 100 cores. Sajib Barua has been doing a number of experiments of ways to datamine the data. One approach he prototypes was to use a data warehouse paradigm that uses the tag information to build comparative graphs. He showed examples using the Old Bailey data in Tableau. A second approach he tried is Frequent Itemsets (FI). Again he adapted the data and ran it to generate some example FIs like "silver peices money called king". The last approach is to try clustering trials using the FIs. Epidemiology of IdeasPatrick Juola talked about the application of the Conjecturator to various datasets. The Conjecturator is easy to parallelize. He used about a year of compute time on Orlando, Victorian Novels, and on the Old Bailey trial records. The big problem is the large number of assertions that the Conjecturator generates. What do we do with the pile of facts? He showed a visualization that we can use to look at the assertions. OrlandoSusan Brown talked about experiments to improve XML (actually SGML) searching of a 8 million word corpus. They are trying to provide users with a much smarter interface to Orlando. One thing they are doing is trying clustering tools to see if they can help people navigate results. They got a nice clustering of novelists which was visualized in the Mandala browser. Day 5Jonathan Schaeffer on HPC and HumanitiesJonathan talked about where we are today and what is right and wrong about Compute Canada. Compute Canada Calcul is a national HPC platform.
Compute Canada is 61 institutions across Canada. Over 1,100 PIs across Canada with associated research assistants. Usage is dominated by sciences: for example Physics 24%, Biochemistry 19%, Chemistry 18%, and Engineering 16%. Compute Canada got about 150 million which seems like a lot, but there hasn't been the refresh funding. CC is operating poor. Good projects should have 50% to 100% more funding again for operating, CC doesn't have that - they are underfunded at the moment. CC has all this terrific equipment but not enough operating funds to really take advantage of it. CFI projects are very hard to run with so many players as it bottlenecks around the slowest. There are many successes:
The call for NPF-2 is coming. It will probably take until 2014 before anything can be bought. Planning is underway. CFI has been blunt about things that need to be done better:
NPF-2 will not be limited to HPC. CC will be in competition with many other national projects. Digital humanists should articulate our needs so that CC can work with us. Computing needs to be talked about from source to destination. From acquisition of data to dissemination of research. We need to invest in people, in cloud computing, and in visualization. Final DiscussionWe had a final discussion on the needs of computing humanists and ways of engaging the CC community. The final report with recommendations will be put up once written. Possible recommendationsHere are some possible recommendations coming out of the workshop.
|
Navigate |
Page last modified on May 24, 2010, at 04:16 PM - Powered by PmWiki |